Architecting Ideas #17

ghost · 2023-04-12T05:21:36Z

ghost
Apr 12, 2023

Original Issue title: Memory Leak - calling itself inside of itself, forever, with a solution

Hi!
I've learned something and I've found a solution

Line 202 `foreach($crawling as $site)`

My PHP error logs went haywire after running the script for an extended period of time, 971 occurrences of itself and it causes a crash at that point

Stack trace: #2 Crawler.php(494): Crawler->followLinks() #3 Crawler.php(494): Crawler->followLinks() #4 Crawler.php(494): Crawler->followLinks() #5 Crawler.php(494): Crawler->followLinks() . . .

and it goes on and on, up to 971 occurrences.

The issue is that , this means that what the script has been doing is
Get called for Url1 (and doesn't call getDetails for it?)
View first a href on Url1
Visit a href URL(Lets call it URL2) from Url1 and call getDetails on it
Visit the first a href on URL2. calls getdetails
visits the first a href on the last a href on url 2 from url1.
visits the first a href on the last a href on the last a href on url2 from url 1.
etc etc etc, it goes on and on forever UNTIL one subprocess doesn't have any a hrefs or all the a hrefs were processed, then it goes to its parent node.

Meaning

The first original array worth of a href URLs are not fully processed until the sub-processes finish, and for the sub-processes to finish the sub-sub-processes have to finish, and eventually you get to a point where this happens Allowed memory size of 16582912000 bytes exhausted (tried to allocate 20480 bytes)

Solution!

Line 166 function followLinks($url) to function followLinks($url, $depth = 0)
170 Insert if ($depth >= 12) {return;} replace 12 with how deep you want it to go
Line 203: erase and fill with if(isset($site)){$this->followLinks($site, $depth + 1);}

Alternative Solution

public function count_method_occurrences($method_name) { $backtrace = debug_backtrace(); $count = 0; foreach ($backtrace as $trace) { if (isset($trace['class']) && isset($trace['function']) && $trace['function'] === $method_name) { $count++; } } return $count; }

Call it via

if ($this->count_method_occurrences('followLinks') < 12) { foreach(){} }

*Please note occurrences via this will be +1 than $depth

safesploit · 2023-04-13T17:17:41Z

safesploit
Apr 13, 2023
Maintainer

@dehlirious regarding your earlier pull request, I have some concerns, which I made comments about in the request. Just awaiting feedback, on those. Otherwise, I'm happy to merge 😄

--
@dehlirious I have made indexing changes to your post, so I can more easily trace through the code and your recommendations.

Below I have reformatted the code for GitHub readability and renamed functions and variables from snake_case to camelCase.

Alternative Solution

public function countMethodOccurrences($methodName)
{
	$backtrace = debug_backtrace();
	$count = 0;
	
	foreach ($backtrace as $trace) 
	{
		if (isset($trace['class']) && 
			isset($trace['function']) && 
			$trace['function'] === $methodName) 
		{
			$count++; 
		}
	}

	return $count;
}

Call `countMethodOccurrences()` via

if ($this->countMethodOccurrences('followLinks') < 12) 
{ 
	foreach()
	{
		//
	} 
}

0 replies

ghost · 2023-04-15T06:10:59Z

ghost
Apr 15, 2023

Regarding that pull request; ignore it, I didn't realize it included several unrelated commits.

And I'm a fan of the restyling/naming. Although all I'll mention is currently I think that function could provide potential future usage but I'm just using this version myself

And I can't seem to get my code to appear formatted correctly, on my phone.

followLinks($url, $depth = 0) {
    If($depth >= 12) {
        return;
    } 
    if(isset($site)){
        $this->followLinks($site, $depth + 1);
    }
}

And on a sidenote; I have made a lot of changes and I'll be sure to try and share the useful/beneficial ones before I (attempt to) rewrite it and use Solr instead of MySQL.

0 replies

safesploit · 2023-04-15T09:14:58Z

safesploit
Apr 15, 2023
Maintainer

Check your last comment I edited, you need to use triple backticks (on separate lines) to enclose the code. See, Creating and highlighting code blocks.

Completely understand, I am in the process of rewriting an old project I wrote in snake_case to camelCase. A consistent convention is all that matters.

Sure, I appreciate you pushing updates. If you have a public branch on your forked version I can always check changes and new features there.

I presume you're using Apache Solr for full-text search, in which case I might suggest Redis. The reason I suggest Redis is besides being a multi-threaded cache and full-text search, it also has the benefit of being a database which supports replicas and edge case scenarios (such as that offered by Kubernetes).
With that opinion in mind, there are extensive evaluations for different database models and key-value pair while having amazing performance benefits initially, being denormalised can lose performance benefits when scaling. Furthermore, Redis under default configuration is not ACID compliant, a major reason for its blazing performance when compared with a relational database like MySQL.
If you're deciding between database models, I might suggest watching Which Database Model to Choose?
@HighPerformanceProgramming.
If you wish to discuss design choices further feel free to reach out, my email is on my website and LinkedIn.

0 replies

ghost · 2023-04-15T23:58:15Z

ghost
Apr 15, 2023

Just released the changes. Check them out. Let me know what you think.
the descriptions & sentences column are going to be blank because in my instance I processed the text from the website via python not php, so I don't have a php implementation of that available. So for now the only fields that will get data for searching are "url", "title", "meta_descriptions","meta_keywords"
https://github.com/dehlirious/doogle
commit: main...dehlirious:doogle:AlteredCore

EDIT: Check it out live at https://zrr.us/doogle/search.php?term=chatgpt&type=sites
but please note that database querying time is what is making it so slow(edit: reduced from ~40s to ~8.6)
I'm open to all critique and criticisms. But do note; it's a work in progress and I haven't exactly finished it and buttoned up every aspect in the way I want to.

"I presume you're using Apache Solr for full-text search"
I haven't implemented it yet, have not looked into it well enough. Thanks for the suggestion of redis, I'll look into it after watching that video.

0 replies

safesploit · 2023-04-20T19:22:19Z

safesploit
Apr 20, 2023
Maintainer

From what I've had a chance to look at so far, I like the change. I couldn't help but notice in crawl-manual.php you've placed a couple of Udemy courses. Would I be correct to assume you want to integrate ChatGPT and implement NLP?

At scale, I can begin to see the performance limitations of using PHP and MySQL. A possible solution for the server-side language without a complete rewrite could be web assembly (WASM) as PHP is compatible. Alternatively, C++ or Go, if a rewrite is acceptable, as either compiles into machine code.

Do you intend on making your Python backend or the entire source code open-source?

0 replies

ghost · 2023-04-20T23:11:03Z

ghost
Apr 20, 2023

For the URLs, I quite simply asked ChatGPT to generate me 50-100 URLs based on a few categories and AI/ChatGPT was one of them.

I haven't tried to implement ChatGPT, for processing the URL's data and phrases, I think it'd be way too expensive for me right now, but I was planning on having a ChatGPT response on the side of all the results.

I am currently using a NLP for processing the URLs, although at this current moment , I'm not using it to it's fullest capacity.

Also, I stopped processing the unprocessed "sentences" column and reduced load time from 40s to 4-8s, and with redis caching implemented, 200ms load time after initial search.

However, if you lookup a basic word like "a" with more than 200-300k results, still quite slow on the initial search.

Even though the MySQL query says "limit 0-20", I feel MySQL is still looking through the entirety of the database even after it's found it's 20 matches. Unsure if that is the case though but performance wise that is what it feels like.

I'm going to go through a few different options, my next test is going to be with Solr, and I'll see the performance difference.
I like how Solr can index stuff, it sounds like it should be a good bump up from my current performance.

And I could release the python file, but I don't know if it's done. Sometimes I'll write code and find out there was a more efficient way all along. Like maybe I don't need the python spaCy NLP library and I can do it in PHP or use a less advanced php NLP library and achieve the same or similar result.

With the current setup, I was able to process ~250k URLs an hour by the time I fine tuned it.
Running the python via gunicorn, multi threaded and having a bash service to run equal amounts of the crawl/crawl-manual via PHP-CLI, so 4 threads of each, and the amount per hour would increase if you had the resources to have more threads.
Prior to that I was only achieving about 2.5k/hr, so as soon as I could do 250k I had to stop because queries were so dang slow.

Currently just rewriting more base stuff, like right now I've made every submit button fetch the URL via Ajax, cleaning up my solutions, removing unnecessary code.

0 replies

safesploit · 2023-04-24T21:08:30Z

safesploit
Apr 24, 2023
Maintainer

I see, that explains it then. I completely understand the need for test data. I personally used my website as it has hyperlinks to other websites like GitHub, LinkedIn, StackOverflow and Twitter. Bleeping Computer or BBC were other good resources for indexing. But please keep in mind about your bot being treated as a DoS and banned.

I personally am hosting this on a LAN and VPS so that I won't incur additional fees. Are you using a cloud provider like AWS, Azure or GCP, where you pay for the resources you use?

I don't think I understand, can you explain 'I stopped processing the unprocessed "sentences" column'?
I'm glad to hear Redis so far is giving you a desired result.

From MySQL documentation, the LIMIT query optimisation is explained to prefer a 'full table scan' (in my opinion this wording is ambiguous for documentation). However, the documentation does seem confirm your suspicion 🤔

Remember that the full-text search feature Apache Solr offers is also offered by Redis. Mentioning Redis, regarding caching servers, if you were unaware Memcached is an option. But again since Redis 6 (which added multi-threading support) and has native replica support, there is little reason to have such a complex technology stack.

I have done that enough with feeling a solution wasn't ready. From my experience, just push to GitHub (even if not properly working) and go from there. Otherwise, the solution never feels ready for the public's eyes. My exception to this was when doing malware development, which was I very cautious about before making any of the repos public But, the community is often accepting that everything is always a 'work in progress' and tends not to criticise harshly.
From a few past projects, I'd always suggest making your web app a containerised version to make it easier for others to start using

Strictly speaking, the search engine and web crawler portions are separate. So it would be completely acceptable if you programmed each using a different language. Take for instance the modern web browser solution, client-side is using JavaScript while server-side can use Node (JavaScript), PHP, Python, etc. Hence, I would argue mixing technology stacks isn't necessarily a bad thing, just consider the complexity of maintaining and how others may perceive it (because it's open-source).

I am questioning whether Doogle's backend would benefit from a migration to another language and database though 🤔 I'm just not sure if I have the time to undergo such a project at the moment!

That Ajax solution will be a much needed improvement to usability.

0 replies

ghost · 2023-04-25T04:21:18Z

ghost
Apr 25, 2023

But please keep in mind about your bot being treated as a DoS and banned. Did you say that knowing my website was recently labeled as "deceptive" or were you just mentioning it and this happens to be a coincidence?
Today I found out using Firefox that my domain was blacklisted. "Firefox Deceptive site ahead(zrr.us). Firefox blocked this page because it may trick you into doing something dangerous like installing software or revealing personal information like passwords or credit cards. Advisory provided by Google Safe Browsing."
Yikes, I think its because of "pocketproxy.php" , a different IP was used for crawling.
I'll try to appeal it with them but I think it's time for a new domain, unfortunate, I liked that three character domain.

I currently have a dedicated server that I use for all projects.

Sorry if what I said was lacking context, "sentences" is just the indexed URL's plaintext. I was seeing what the difference between the descriptions column and sentences column were during initial testing, when it came to results, sentences isn't a good source and resulted in a lot of false positives, so I stopped getting results from the sentences column, and it reduced load time significantly.

And I looked into the MySQL thing when I first had the suspicion and got the same answer as you. Seems it's just the way it works.

I'll think about that. Even if it was just a column or two stored in Redis/memcached, it could speed up the MySQL query for sure. You're definitely right, there's no need for a complex stack. But. I'm willing to test things out. Ultimately, I have to write the code for any implementation and as I work through that, I'll get a better understanding for what exactly I need/want/what I can do. For example,
After spending an hour or so figuring out the solarium php library and writing a rough draft of code, It is the correct course of action to optimize my usage of MySQL before switching to anything else. I already have ideas about how I could try and do it haha.

"... Otherwise, the solution never feels ready for the public's eyes"
Very true and relatable.

Personally, I'm not moving away from PHP unless there's a critical reason to. Maybe to HHVM though.
After my first trial with Solr; I'd say it's not worth the complexity when making a better system for data using MySQL could ultimately be more rewarding and it would definitely be easier for the public to use. And I don't think that the code itself is slowing anything down, the only slowdown is database queries(at least with my dataset) so switching languages wouldn't be of major benefit. I'd save the strain and focus it elsewhere.

I'll have to reach out on Twitter sometime.

0 replies

safesploit · 2023-04-25T18:47:43Z

safesploit
Apr 25, 2023
Maintainer

Afraid that it's a coincidence. Web crawlers are known to create lots of traffic from a single IP address, which in the context I was speaking about can result in some type of ban. IP address bans are the easiest, but often the most problematic due to dynamic IP addresses.
Definitely appeal, I have only just seen PocketProxy and you being the developer would know better. But the HTTP_X_FORWARDED_FOR header is revealing my true IP address when using Whoer via Pocketproxy. I discussed different proxy server implementations here as transparent, anonymous and elite proxy. However, it is behaving as a transparent proxy due to using the HTTP_X_FORWARDED_FOR header. Therefore, you could use that as grounds for your appeal.

Is your dedicated server using separate environments for each project, such as virtualisation (QEMU or VirtualBox) or containerisation (Docker/Podman)?
Also, do you have all traffic coming from the same IP address the provider gave you?

I see, completely understand that choice. When developing I was just using as much data as possible, because I was learning how web crawlers worked. But now I realise the most useful data was the domain names, titles and keywords for site searches. Naturally, image searches are another story, and being happy with that solution I never felt the need to implement shopping or video searching.

Quite unfortunately about that MySQL solution, and because of how different each SQL implementation is, using the documentation for Microsoft SQL, PostgreSQL or SQLite would not be viable!

I figured as much. Apache Solr has quite a tedious syntax to write. But I'm not necessarily saying Redis is any easier. I will need to look into that solution further myself.

Twitter is a good place for consultation. If you message me on Twitter and don't get a reply, it's more likely I haven't seen your message.

0 replies

ghost · 2023-04-26T07:21:47Z

ghost
Apr 26, 2023

Well, pre-8pm EST, the website was still "deceptive" but I just checked and it's gone! No longer labeled as deceptive or on a phishing list. Nothing.

the HTTP_X_FORWARDED_FOR header is revealing my true IP address when using Whoer.

I'm confused about that because I went to the same url and it is just revealing the IP used to make pocketproxy requests, not my personal IP. As far as I'm aware, I don't pass HTTP_X_FORWARDED_FOR. (using Firefox and Chrome)

I use docker the most. But not always, which maybe is a habit that could be amended. I don't currently have a need for virtualization like virtualbox, on my scale. If I had to scale up, absolutely something of that nature.

If you're asking if its going through another proxy, no.

I absolutely enjoy the image search functionality, it works quite well, I haven't made a single change. Adding shopping sounds like it could be a ache overall but well worthwhile once available time arises for it.

Solr is fun/tedious. It is doable though! I've got it mostly working already. But I'm just going to figure out a way to make the MySQL version better for now, anyway.

Sounds good!

0 replies

safesploit · 2023-04-28T16:28:58Z

safesploit
Apr 28, 2023
Maintainer

I see zrr.us is now listed as safe, 5pm GMT. So congrats on the successful appeal!

HTTP_X_FORWARDED_FOR is being requested here here.

If you're that involved with Docker I would suggest looking at Podman as a more secure alternative. But I am still relatively new to the containerisation scene, so just a suggestion. I am still a bit old school and rely heavily on virtualisation, but I am gradually transitioning where appropriate.

The masonry grid layout is nice, but the performance of the framework isn't great on smartphones. Otherwise, it was a good design choice, as it saved a lot of development time.
Not that difficult actually think about the requirements or rather what Google Shopping's search parameters are.
shopping table: | Item Name | Brand | Price | Delivery | Item rating | Condition | URL |
Just as an example, might be able to drop a few unnecessary fields too. After that, it's writing the search functionality.

When I have some time I will send you some relational database suggestions for improving query time. But, Amazon Aurora is a good place to look at architecture choices as it claims to be 7x faster than MySQL server.

0 replies

ghost · 2023-05-01T03:40:14Z

ghost
May 1, 2023

HTTP_X_FORWARDED_FOR is being requested here here.

Initially, I thought so too! However, upon closer inspection, the getUserIp() function is utilized solely within the logcbl() function, which is a logging method employed exclusively in this snippet on line 510~:

if (!$proxy->isValidURL($url)) {//invalid URL
	$proxy->logcbl($url);
	die("Error: The requested URL was disallowed by the server administrator.");
}

I'm open to giving Podman a try, it isn't the first time I've heard of it, and I believe virtualization is the right approach so keep to it when possible, I'd suggest.

Good to know about the framework. And you're right, it should be relatively simple to implement.

And I'll probably try multiple options, so I'm happy to hear more suggestions.

As it turns out, using a prefix wildcard in a MySQL search query disregards all indexes (including full-text ones) and results in a full table scans.
Removing the prefix "%" ($searchTerm = "%" . $parsedInput['term'] . "%";) significantly improved performance, but without extensively reorganizing the data in the descriptions column, the returned results will not be as relevant.
I think a rough comparison is 17000 results vs 130 for the same search term.
I briefly looked into it and MongoDB appears to be a potential contender in terms of performance as well.
At the moment, I'm not looking for a fully managed database.
Managed hosting would be a fantastic choice if I were running a business that required constant uptime and limitless scalability.
For now, I have time, so I want to learn how to scale it myself, maybe save a few bucks in the future.

0 replies

safesploit · 2023-05-04T19:19:59Z

safesploit
May 4, 2023
Maintainer

A possible table for shopping.

CREATE TABLE `shopping` (
  `id` INT(11) NOT NULL,
  `title` VARCHAR(255) NOT NULL,
  `price` INT NOT NULL,
  `url` VARCHAR(255) NOT NULL,
  `description` TEXT,
  `category` VARCHAR(255),
  `dateAdded` DATETIME NOT NULL,

-- Based on the images table
  `imageUrl` varchar(512) NOT NULL,
  `alt` varchar(512) NOT NULL,
  `title` varchar(512) NOT NULL,
  `clicks` int(11) NOT NULL DEFAULT '0',
  `broken` tinyint(4) NOT NULL DEFAULT '0',
  PRIMARY KEY (id)
);

The masonry grid layout would probably work well too. Additionally, a navigation panel on the left or right corner (taking up 15% - 25%) would be useful for the search parameters.

The following is generated template code, which may not comply with conventions used in Doogle's development.

<!DOCTYPE html>
<html>
<head>
	<title>Doogle Shopping Search</title>
	<style>
		body {
			margin: 0;
			padding: 0;
		}

		.navbar {
			position: fixed;
			top: 0;
			left: 0;
			height: 100%;
			width: 20%;
			background-color: #f2f2f2;
			overflow: hidden;
			display: flex;
			flex-direction: column;
			align-items: center;
			padding-top: 20px;
		}

		.navbar h1 {
			margin-top: 0;
			margin-bottom: 20px;
			font-size: 24px;
			color: #333;
			text-align: center;
		}

		.navbar input[type=text] {
			width: 80%;
			padding: 12px 20px;
			margin: 8px 0;
			box-sizing: border-box;
			border: 2px solid #ccc;
			border-radius: 4px;
			font-size: 16px;
		}

		.navbar button[type=submit] {
			width: 80%;
			background-color: #4CAF50;
			color: white;
			padding: 14px 20px;
			margin: 8px 0;
			border: none;
			border-radius: 4px;
			cursor: pointer;
			font-size: 16px;
		}
	</style>
</head>
<body>
	<div class="navbar">
		<h1>Shopping Website</h1>
			<form action="search.php" method="GET">
				<input type="text" placeholder="Search products..." name="search">
				<button type="submit">Search</button>
			</form>
	</div>
	<div style="margin-left: 20%;">
		<!-- rest of the page content -->
	</div>

	<script>
		function toggleNavbar() {
			var navbar = document.querySelector(".navbar");
			var hideButton = document.querySelector(".hide-button");
			navbar.classList.toggle("hidden");
			hideButton.classList.toggle("hidden");
		}
	</script>
</body>
</html>

The tabsContainer and mainResultsSection can be modified for have an additional $type = "shopping".

--

As I understand the following can be used to optimise database search time.

Indexing (B-tree index WHERE x BETWEEN a AND b, hash index WHERE x = y or full-text index)
Query optimisation (something we spoke about earlier)
Data normalisation
Limiting search results
Caching
Full-text search (Solr)

Regarding, indexing, query optimisation and limiting search results, I know those are areas of improvement. When developing Doogle I was not particularly aware of its importance.

Completely agree with that point of learning to scale yourself. Learning was the purpose behind this repo!

0 replies

ghost · 2023-05-05T05:50:50Z

ghost
May 5, 2023

I sure wasn't aware of db importance either, this is the first time I've needed a potential millions of rows efficiently processed.
Right now I'm just working on side projects. This is my rough design, sidebar to doogle results.
First response is just what the user queried, and then you can extend the conversation on further. I can also add other features like image generation, audio to transcript and so on. Making all the code snippets auto highlighted was fun.
I also just recieved a GPT-4 API key from OpenAI, so I may include that, but it'd be limited to public use since it'll cost me.

Not sure if it'll stay this way though, I might have gpt responses come the same way regular results are formatted instead of next to it, or have it generate results, or a mixture,, I'm really not sure yet since the possibilities are quite endless.

Did you know "Doogle" has already been used as a search engine name and was sued by Google?

And I like that, I'm definitely down to implement such a feature but I am probably going to hold off until I've rectified query times,

With MySQL, B-tree indexes for text searches, and full-text indexes with a leading wildcard, will prevent the indexes from being utilized effectively, resulting in a full table scan which also will ignore limiting search results amount.
MongoDB, on the other hand, offers a solution for text search called text indexes. They allow you to perform searches with wildcards in any position, phrase matching, and negations.
Solr can also manage wildcards at any position within a search term.

I'm just going to have to dedicate some time to make the switch.

Sidenote!! Vector databases.

I'm so interested in vector databases, I forgot about their existence though.
Pinecone is a managed database that'll cost $0.0960/hour per 5 million vectors(estimated for 768 dimensional embeddings without metadata), it sounds interesting and it would have insane performance and results.
There is a Starter Free plan too, limited to one index and one project.

There are other providers too for vector databases but pinecone is the only one I have an account with currently.
The only issue is that vector databases are quite resource intensive, it's potentially overkill as well.

EDIT: Milvus for a self hosted solution. Can provide significant performance benefits if you have a large amount of unstructured data. but also is not designed for text search specifically

Learning is the motivation behind all my projects as well, right alongside the potential to actually have something big that people enjoy using.

0 replies

safesploit · 2023-05-11T18:38:17Z

safesploit
May 11, 2023
Maintainer

I understand, but once implemented to use a GPT4 API key you can always remove the key and allow others to purchase their own. Do something like $gpt4ApiKey = $gpt4ApiKeySecret where $gpt4KeySecret is part of a require('.secrets.php') which will be excluded from Git pushes via the .gitignore file. Also, do an if-statement where if the value is "" or NULL the GPT4 functionality is disabled (or put a toggle in the code).

Connecting a search engine with a generative solution would be quite interesting as the output will change each time.

I never looked into the name origin as 'Doogle' was originally named 'Google Clone'. From reading the article the company was operating 'for profit' while this project is free and open-source under a MIT License. However, Google may have copyright (intellectual property) around the logo design and/or name convention (not sure myself). So, changing the logo design, name and removing the pagination system's logos at the bottom of the page could be necessary for the future.

Facebook spent some time rectifying this very same issue where PHP and MySQL became limiting factors. Solutions to mitigate query time due to SQL limitations were explored as part of the engineering in 2005/06.

I have heard of vector databases, but never researched them. I'll do some research before I reply.

2 replies

ghost May 13, 2023

Going on about Facebook...

MySQL is the primary database used with MyRocks as the storage engine

if facebook didn't shy away from mysql, maybe I have to give myrocks a shot. 2-4x better compression, faster data loading, but maybe not.
The migration didn’t have much difficulty since just the DB engines changed and the core tech MySQL was the same.
RocksDB fits best when we need to store multiple terabytes of data in one single database.
Some of the typical use cases for RocksDB:
1. Implementing a message queue that supports a large number of inserts & deletes.
1. Spam detection where you require fast access to your dataset.
1. A graph search query that needs to scan a data set in real time.

Memcache - Sits in front of MySQL. Acts as a Cache

I do the same but with redis at the moment, both have distributed capabilities but memcached was built from the ground up to be a distributed caching system.

Apache Cassandra is used for the inbox search

Want more power? Add more nodes
Worth looking into. Seems like a whole ecosystem I don't currently have a need to be a part of, though.
Hadoop is part of a larger data pipeline where data is ingested from various sources, processed and analyzed in parallel across a cluster of machines, and then the results are either stored for further analysis or exported to other systems.

PrestoDB

High-performance distributed SQL query engine for big data, allowing querying data where it lives, including Hive, HBase, relational databases, and proprietary data stores.
Potentially something to add later for scaling purposes. But again, maybe just a piece of the Hive puzzle that I'm not sure I want to be a part of

etc, etc
They did some interesting stuff.

More..

Twitter uses (their own version of) MySQL for tweets and users, and their own special kind of graph database, FlockDB, built on top of MySQL, for relations (followers, following,...),
excels in situations where the relationship between entities (like the follower/following relationship in Twitter) is the most important aspect of the data
LinkedIn used/uses Oracle Database and Voldemort.
Elastic Search: pretty good open source for search applications and related data discovery
It centrally stores your data and allows you to quickly search through it (full text). It's often used for log or event data analysis, full-text search, and other cases where real-time, ad-hoc querying is required.
YouTube uses MySQL but they are moving to Google's BigTable.

Maybe even a NoSQL database(like MongoDB).

Overall I'm going to have to try MongoDB, its nice that its designed for horizontal scalability
and MyRocks with MariaDB (since I'm already using Maria). Should be simple although still quite complicated, but it doesn't support some features of InnoDB like full-text indexing and spatial indexing, but, I already can't use the functionality of full-text indexing or any index with a query with a wildcard so doesn't really matter.

I'm surprised that most large websites are still using some variation of MySQL/postgres, some big ones switched to NoSQL alternatives but still.

Annd after that, I can worry about just optimizing what data I'm storing, reduce it on that end as well and have it be quite faster.

safesploit Mar 14, 2024
Maintainer

Styled Font

Because of the copyright concern I mentioned. I do think at some point it could be worthwhile developing a styled font to images script that can be used for the admin to generate the images in/assets/images.

Code

https://stackoverflow.com/a/52866146/4443012

JavaScript

var tCtx = document.getElementById('textCanvas').getContext('2d'),
  imageElem = document.getElementById('image');

var font = '400 50px "Fredoka One script=all rev=2", "Adobe Blank"';

document.fonts.load(font)
  .then(function() {
    document.getElementById('text').addEventListener('keyup', function() {
      // Set it before getting the size
      tCtx.font = font
      // this will reset all our context's properties
      tCtx.canvas.width = tCtx.measureText(this.value).width;
      // so we need to set it again
      tCtx.font = font;
      // set the color only now
      tCtx.fillStyle = '#A0A';
      tCtx.fillText(this.value, 0, 50);
      imageElem.src = tCtx.canvas.toDataURL();
    }, false);
  });
.xx {
  font-family: "Fredoka One script=all rev=2", "Adobe Blank";
  font-weight: 400;
  font-style: normal;
  font-size: 50px;
}

@font-face {
  font-family: 'Fredoka One script=all rev=2';
  font-style: normal;
  font-weight: 400;
  src: url(https://fonts.gstatic.com/l/font?kit=k3kUo8kEI-tA1RRcTZGmTmHEG9St0C3d1om8Mz6slqBKRtvjzUJ6xAJaGHLTbv9tHVEq-h1ylCtXSeDBENILlzkfzUJOiM594gqLtnzccnJfhpQc-ZP_ud1_NbotCXKqzPs_SH7xk6cjQyW2echUD_r7vTfZ5gJBot49AddTHjLYLXysgiMDRZKV&skey=fac42792a60c2aba&v=v5) format('woff2');
}

canvas {
  border: 1px black solid;
}

#textCanvas {
  display: none;
}

HTML

<canvas class="xx" id='textCanvas' height=65></canvas>
<img id='image'>
<br>
<textarea id='text'></textarea>

ghost · 2023-09-11T22:05:33Z

ghost
Sep 11, 2023

I have not developed the project in a while. But, check this out.(demo) I think if I were to hop back on the project, I'd use librex as a base and add my own search engine to the results to have the best results. It's something to think about.

0 replies

safesploit · 2023-09-14T18:14:08Z

safesploit
Sep 14, 2023
Maintainer

I have not developed the project in a while. But, check this out.(demo) I think if I were to hop back on the project, I'd use librex as a base and add my own search engine to the results to have the best results. It's something to think about.

Very cool, I should take some time and see how the torrents and Tor searches backend works! Thanks for sharing.
Only part I didn't like was how the images are returned. Like how Google directs to the website rather than the source image, LibreX does the same. But nothing a little frontend tweaking couldn't fix, Doogle already has the source code for it!

0 replies

ghost · 2023-12-23T09:13:37Z

ghost
Dec 23, 2023

Turns out librex is dead, but LibreY isn't!

I'm starting to get in some motion, hopefully no hiccups knock on wood - between work and life, bleh! but I've been interested in starting up the doogle project again, in conjunction with librey functionality would be wonderful. But I've got another starting point to chip away at first.

And not going to lie, I absolutely love how doogle handles images. Flawless for what it is, and with millions of entries when I tested it, absolutely no lag. Versatile for sure and a good amount of headroom to work with!

Still ultimately, I need to figure out some mysql alternative. What a headache though lol

2 replies

safesploit Mar 13, 2024
Maintainer

I do miss the old days of Google and DuckDuckGo handling images where it could point you directly to the source without having to do some dark magic with HTML!

Building a database infrastructure still feels like the easier option... I say that sparingly 😆

ghost Mar 15, 2024

Honestly, I like the idea of having a self-hosted search engine more than relying on these other giants that are known for making changes and removing content.

And truthfully, I think what I should do is figure out how the existing database infrastructure for google, duckduckgo, or others like you.ai (i think thats what its called?) work -- because, there is No point to reinvent the wheel you know?

ghost · 2024-03-04T05:25:21Z

ghost
Mar 4, 2024

I just learned about Tarantool.

This may have the same issue as all the other nosql database had (not being able to do wildcard searches); but I'm excited enough to share it anyway! And I'll look into it further. This could completely be inapplicable to the project but it's still really cool.

(Very little testing on my end. Just ensured it was able to be installed and it worked, will implement it in one of my projects later, where the real testimony will come out)
Quick write-up before I go to bed. Mostly copy/pasted.

Tarantool is an in-memory computing platform that combines the capabilities of a high-performance NoSQL database with a flexible application server, offering a unique approach to handling data-intensive applications. It's designed to process requests at the speed of RAM. Tarantool supports a variety of data models, including key-value, document, and even relational models.

It offers excellent scalability options, both vertically and horizontally.
With its built-in Lua application server, Tarantool allows you to write stored procedures and triggers directly in Lua, enabling complex data manipulations close to the data itself.
Built-in support for master-master replication and sharding helps ensure high availability and resilience of your data.

Tarantool supports persistent storage alongside its primary in-memory capabilities:

Tarantool periodically saves the state of the entire database (all the data in RAM) to disk in a snapshot file.
Alongside snapshotting, Tarantool uses Write-Ahead Logging to record every change (insert, update, delete) that occurs in the database. These logs are used to restore the database state up to the last committed operation in case of a restart or failure.
Tarantool includes an on-disk storage engine called Vinyl, designed for workloads that don't fit entirely into RAM. Vinyl allows storing larger datasets by keeping hot data in RAM and colder data on disk, automatically managing data movement between RAM and disk based on access patterns.

Might I specify here, you can choose to completely store the database on disk, only giving it a certain amount of ram for cache/buffering operation/ I believe indexes too

Disk-Based Secondary Indexes in Vinyl: For the Vinyl engine, secondary indexes can be stored on disk, reducing the RAM footprint. This is particularly useful for datasets with large secondary indexes.

But there's a learning curve associated with adopting Tarantool's data models and Lua scripting.

More!

Immediate Data Processing: Being an in-memory database, Tarantool excels at rapid data processing, making it ideal for real-time applications.
Concurrent Reads and Writes: Tarantool's architecture supports high levels of concurrent reads and writes without significant performance degradation.
Simplified Architecture: The combination of a database and application server within Tarantool can simplify the technology stack and reduce latency between data manipulation and application logic.

Cons:

the in memory aspect. Not sure how well the disk aspect works or how fast it is.
Backup and Recovery Considerations: Despite support for persistent storage, the in-memory nature of Tarantool demands a rigorous approach to data backup and recovery strategies to safeguard against data loss.
Users coming from traditional SQL backgrounds might find Tarantool's approach and Lua scripting to be less familiar, potentially increasing the learning curve and migration complexity for existing applications.
Operational Complexity: Effective management of a Tarantool deployment, especially at scale, requires a solid understanding of its architecture and operational model, which may present a learning curve.
Community and Ecosystem: While growing, the Tarantool community and ecosystem might not be as large or extensive as those for more established databases, potentially affecting the availability of third-party tools and integrations.

Sidenote: sorry for not really being on this repo/project. Sidetracked like a mf. It's one project to the next. But, I have been messing with large datasets and large quantities of network requests as of late so when I do come back to it, it should be quite a fun time.
I'm still 110% for the idea of having my own search engine.

The only unfortunate aspect is .. I kind of misplaced the doogle source code I had . I think I have it. But I don't know if I actually have the latest version of it? Idk, If not the latest it looks like I have a updated enough version to be able to work from. Plus, I've been learning a bit as of late anyway so there's a fine chance I wouldn't even need it. I was an extreme fan of my crawling implementation. It crawled and collected more data than I could handle, I had 80+ second search queries. I also didn't optimize what I crawled whatsoever among other things. Still, fun times. This still is a overarching goal.

1 reply

ghost Apr 3, 2024

Something else to look at:
https://youtu.be/6RKwcKktnGE

safesploit · 2024-03-13T21:27:50Z

safesploit
Mar 13, 2024
Maintainer

A small note about Doogle being moved from 'safesploit' to 'safesploitOrg'. So, as to avoid any confusion.

#25

0 replies

Architecting Ideas #17

Uh oh!

Uh oh!

Original Issue title: Memory Leak - calling itself inside of itself, forever, with a solution

Line 202 foreach($crawling as $site)

Meaning

Solution!

Alternative Solution

Replies: 20 comments · 5 replies

Uh oh!

Uh oh!

safesploit Apr 13, 2023 Maintainer

Alternative Solution

Call countMethodOccurrences() via

Uh oh!

Uh oh!

Uh oh!

safesploit Apr 15, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

safesploit Apr 20, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

safesploit Apr 24, 2023 Maintainer

Uh oh!

Uh oh!

safesploit Apr 25, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

safesploit Apr 28, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

safesploit May 4, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

safesploit May 11, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

safesploit Mar 14, 2024 Maintainer

Styled Font

Code

JavaScript

HTML

Uh oh!

Uh oh!

Uh oh!

safesploit Sep 14, 2023 Maintainer

Uh oh!

Uh oh!

Line 202 `foreach($crawling as $site)`

Replies: 20 comments 5 replies

safesploit
Apr 13, 2023
Maintainer

Call `countMethodOccurrences()` via

safesploit
Apr 15, 2023
Maintainer

safesploit
Apr 20, 2023
Maintainer

safesploit
Apr 24, 2023
Maintainer

safesploit
Apr 25, 2023
Maintainer

safesploit
Apr 28, 2023
Maintainer

safesploit
May 4, 2023
Maintainer

safesploit
May 11, 2023
Maintainer

safesploit Mar 14, 2024
Maintainer

safesploit
Sep 14, 2023
Maintainer