Replies: 20 comments 5 replies
-
@dehlirious regarding your earlier pull request, I have some concerns, which I made comments about in the request. Just awaiting feedback, on those. Otherwise, I'm happy to merge 😄 -- Below I have reformatted the code for GitHub readability and renamed functions and variables from snake_case to camelCase. Alternative Solution
Call
|
Beta Was this translation helpful? Give feedback.
-
Regarding that pull request; ignore it, I didn't realize it included several unrelated commits. And I'm a fan of the restyling/naming. Although all I'll mention is currently I think that function could provide potential future usage but I'm just using this version myself And I can't seem to get my code to appear formatted correctly, on my phone.
And on a sidenote; I have made a lot of changes and I'll be sure to try and share the useful/beneficial ones before I (attempt to) rewrite it and use Solr instead of MySQL. |
Beta Was this translation helpful? Give feedback.
-
Check your last comment I edited, you need to use triple backticks (on separate lines) to enclose the code. See, Creating and highlighting code blocks. Completely understand, I am in the process of rewriting an old project I wrote in snake_case to camelCase. A consistent convention is all that matters. Sure, I appreciate you pushing updates. If you have a public branch on your forked version I can always check changes and new features there. I presume you're using Apache Solr for full-text search, in which case I might suggest Redis. The reason I suggest Redis is besides being a multi-threaded cache and full-text search, it also has the benefit of being a database which supports replicas and edge case scenarios (such as that offered by Kubernetes). |
Beta Was this translation helpful? Give feedback.
-
Just released the changes. Check them out. Let me know what you think. EDIT: Check it out live at https://zrr.us/doogle/search.php?term=chatgpt&type=sites "I presume you're using Apache Solr for full-text search" |
Beta Was this translation helpful? Give feedback.
-
From what I've had a chance to look at so far, I like the change. I couldn't help but notice in crawl-manual.php you've placed a couple of Udemy courses. Would I be correct to assume you want to integrate ChatGPT and implement NLP? At scale, I can begin to see the performance limitations of using PHP and MySQL. A possible solution for the server-side language without a complete rewrite could be web assembly (WASM) as PHP is compatible. Alternatively, C++ or Go, if a rewrite is acceptable, as either compiles into machine code. Do you intend on making your Python backend or the entire source code open-source? |
Beta Was this translation helpful? Give feedback.
-
For the URLs, I quite simply asked ChatGPT to generate me 50-100 URLs based on a few categories and AI/ChatGPT was one of them. I haven't tried to implement ChatGPT, for processing the URL's data and phrases, I think it'd be way too expensive for me right now, but I was planning on having a ChatGPT response on the side of all the results. I am currently using a NLP for processing the URLs, although at this current moment , I'm not using it to it's fullest capacity. Also, I stopped processing the unprocessed "sentences" column and reduced load time from 40s to 4-8s, and with redis caching implemented, 200ms load time after initial search. However, if you lookup a basic word like "a" with more than 200-300k results, still quite slow on the initial search. Even though the MySQL query says "limit 0-20", I feel MySQL is still looking through the entirety of the database even after it's found it's 20 matches. Unsure if that is the case though but performance wise that is what it feels like. I'm going to go through a few different options, my next test is going to be with Solr, and I'll see the performance difference. And I could release the python file, but I don't know if it's done. Sometimes I'll write code and find out there was a more efficient way all along. Like maybe I don't need the python spaCy NLP library and I can do it in PHP or use a less advanced php NLP library and achieve the same or similar result. With the current setup, I was able to process ~250k URLs an hour by the time I fine tuned it. Currently just rewriting more base stuff, like right now I've made every submit button fetch the URL via Ajax, cleaning up my solutions, removing unnecessary code. |
Beta Was this translation helpful? Give feedback.
-
I see, that explains it then. I completely understand the need for test data. I personally used my website as it has hyperlinks to other websites like GitHub, LinkedIn, StackOverflow and Twitter. Bleeping Computer or BBC were other good resources for indexing. But please keep in mind about your bot being treated as a DoS and banned. I personally am hosting this on a LAN and VPS so that I won't incur additional fees. Are you using a cloud provider like AWS, Azure or GCP, where you pay for the resources you use? I don't think I understand, can you explain 'I stopped processing the unprocessed "sentences" column'? From MySQL documentation, the LIMIT query optimisation is explained to prefer a 'full table scan' (in my opinion this wording is ambiguous for documentation). However, the documentation does seem confirm your suspicion 🤔 Remember that the full-text search feature Apache Solr offers is also offered by Redis. Mentioning Redis, regarding caching servers, if you were unaware Memcached is an option. But again since Redis 6 (which added multi-threading support) and has native replica support, there is little reason to have such a complex technology stack. I have done that enough with feeling a solution wasn't ready. From my experience, just push to GitHub (even if not properly working) and go from there. Otherwise, the solution never feels ready for the public's eyes. My exception to this was when doing malware development, which was I very cautious about before making any of the repos public But, the community is often accepting that everything is always a 'work in progress' and tends not to criticise harshly. Strictly speaking, the search engine and web crawler portions are separate. So it would be completely acceptable if you programmed each using a different language. Take for instance the modern web browser solution, client-side is using JavaScript while server-side can use Node (JavaScript), PHP, Python, etc. Hence, I would argue mixing technology stacks isn't necessarily a bad thing, just consider the complexity of maintaining and how others may perceive it (because it's open-source). I am questioning whether Doogle's backend would benefit from a migration to another language and database though 🤔 I'm just not sure if I have the time to undergo such a project at the moment! That Ajax solution will be a much needed improvement to usability. |
Beta Was this translation helpful? Give feedback.
-
I currently have a dedicated server that I use for all projects. Sorry if what I said was lacking context, "sentences" is just the indexed URL's plaintext. I was seeing what the difference between the descriptions column and sentences column were during initial testing, when it came to results, sentences isn't a good source and resulted in a lot of false positives, so I stopped getting results from the sentences column, and it reduced load time significantly. And I looked into the MySQL thing when I first had the suspicion and got the same answer as you. Seems it's just the way it works. I'll think about that. Even if it was just a column or two stored in Redis/memcached, it could speed up the MySQL query for sure. You're definitely right, there's no need for a complex stack. But. I'm willing to test things out. Ultimately, I have to write the code for any implementation and as I work through that, I'll get a better understanding for what exactly I need/want/what I can do. For example, "... Otherwise, the solution never feels ready for the public's eyes" Personally, I'm not moving away from PHP unless there's a critical reason to. Maybe to HHVM though. I'll have to reach out on Twitter sometime. |
Beta Was this translation helpful? Give feedback.
-
Afraid that it's a coincidence. Web crawlers are known to create lots of traffic from a single IP address, which in the context I was speaking about can result in some type of ban. IP address bans are the easiest, but often the most problematic due to dynamic IP addresses. Is your dedicated server using separate environments for each project, such as virtualisation (QEMU or VirtualBox) or containerisation (Docker/Podman)? I see, completely understand that choice. When developing I was just using as much data as possible, because I was learning how web crawlers worked. But now I realise the most useful data was the domain names, titles and keywords for site searches. Naturally, image searches are another story, and being happy with that solution I never felt the need to implement shopping or video searching. Quite unfortunately about that MySQL solution, and because of how different each SQL implementation is, using the documentation for Microsoft SQL, PostgreSQL or SQLite would not be viable! I figured as much. Apache Solr has quite a tedious syntax to write. But I'm not necessarily saying Redis is any easier. I will need to look into that solution further myself. Twitter is a good place for consultation. If you message me on Twitter and don't get a reply, it's more likely I haven't seen your message. |
Beta Was this translation helpful? Give feedback.
-
Well, pre-8pm EST, the website was still "deceptive" but I just checked and it's gone! No longer labeled as deceptive or on a phishing list. Nothing.
I'm confused about that because I went to the same url and it is just revealing the IP used to make pocketproxy requests, not my personal IP. As far as I'm aware, I don't pass HTTP_X_FORWARDED_FOR. (using Firefox and Chrome) I use docker the most. But not always, which maybe is a habit that could be amended. I don't currently have a need for virtualization like virtualbox, on my scale. If I had to scale up, absolutely something of that nature. If you're asking if its going through another proxy, no. I absolutely enjoy the image search functionality, it works quite well, I haven't made a single change. Adding shopping sounds like it could be a ache overall but well worthwhile once available time arises for it. Solr is fun/tedious. It is doable though! I've got it mostly working already. But I'm just going to figure out a way to make the MySQL version better for now, anyway. Sounds good! |
Beta Was this translation helpful? Give feedback.
-
I see zrr.us is now listed as safe, 5pm GMT. So congrats on the successful appeal! HTTP_X_FORWARDED_FOR is being requested here here. If you're that involved with Docker I would suggest looking at Podman as a more secure alternative. But I am still relatively new to the containerisation scene, so just a suggestion. I am still a bit old school and rely heavily on virtualisation, but I am gradually transitioning where appropriate. The masonry grid layout is nice, but the performance of the framework isn't great on smartphones. Otherwise, it was a good design choice, as it saved a lot of development time. When I have some time I will send you some relational database suggestions for improving query time. But, Amazon Aurora is a good place to look at architecture choices as it claims to be 7x faster than MySQL server. |
Beta Was this translation helpful? Give feedback.
-
Initially, I thought so too! However, upon closer inspection, the getUserIp() function is utilized solely within the logcbl() function, which is a logging method employed exclusively in this snippet on line 510~:
I'm open to giving Podman a try, it isn't the first time I've heard of it, and I believe virtualization is the right approach so keep to it when possible, I'd suggest. Good to know about the framework. And you're right, it should be relatively simple to implement. And I'll probably try multiple options, so I'm happy to hear more suggestions. As it turns out, using a prefix wildcard in a MySQL search query disregards all indexes (including full-text ones) and results in a full table scans. |
Beta Was this translation helpful? Give feedback.
-
A possible table for shopping.
The masonry grid layout would probably work well too. Additionally, a navigation panel on the left or right corner (taking up 15% - 25%) would be useful for the search parameters. The following is generated template code, which may not comply with conventions used in Doogle's development.
The tabsContainer and mainResultsSection can be modified for have an additional -- As I understand the following can be used to optimise database search time.
Regarding, indexing, query optimisation and limiting search results, I know those are areas of improvement. When developing Doogle I was not particularly aware of its importance. Completely agree with that point of learning to scale yourself. Learning was the purpose behind this repo! |
Beta Was this translation helpful? Give feedback.
-
I sure wasn't aware of db importance either, this is the first time I've needed a potential millions of rows efficiently processed. Did you know "Doogle" has already been used as a search engine name and was sued by Google? And I like that, I'm definitely down to implement such a feature but I am probably going to hold off until I've rectified query times, With MySQL, B-tree indexes for text searches, and full-text indexes with a leading wildcard, will prevent the indexes from being utilized effectively, resulting in a full table scan which also will ignore limiting search results amount. I'm just going to have to dedicate some time to make the switch. Sidenote!! Vector databases. I'm so interested in vector databases, I forgot about their existence though. There are other providers too for vector databases but pinecone is the only one I have an account with currently. EDIT: Milvus for a self hosted solution. Can provide significant performance benefits if you have a large amount of unstructured data. but also is not designed for text search specifically Learning is the motivation behind all my projects as well, right alongside the potential to actually have something big that people enjoy using. |
Beta Was this translation helpful? Give feedback.
-
I understand, but once implemented to use a GPT4 API key you can always remove the key and allow others to purchase their own. Do something like Connecting a search engine with a generative solution would be quite interesting as the output will change each time. I never looked into the name origin as 'Doogle' was originally named 'Google Clone'. From reading the article the company was operating 'for profit' while this project is free and open-source under a MIT License. However, Google may have copyright (intellectual property) around the logo design and/or name convention (not sure myself). So, changing the logo design, name and removing the pagination system's logos at the bottom of the page could be necessary for the future. Facebook spent some time rectifying this very same issue where PHP and MySQL became limiting factors. Solutions to mitigate query time due to SQL limitations were explored as part of the engineering in 2005/06. I have heard of vector databases, but never researched them. I'll do some research before I reply. |
Beta Was this translation helpful? Give feedback.
-
I have not developed the project in a while. But, check this out.(demo) I think if I were to hop back on the project, I'd use librex as a base and add my own search engine to the results to have the best results. It's something to think about. |
Beta Was this translation helpful? Give feedback.
-
Very cool, I should take some time and see how the torrents and Tor searches backend works! Thanks for sharing. |
Beta Was this translation helpful? Give feedback.
-
Turns out librex is dead, but LibreY isn't! I'm starting to get in some motion, hopefully no hiccups knock on wood - between work and life, bleh! but I've been interested in starting up the doogle project again, in conjunction with librey functionality would be wonderful. But I've got another starting point to chip away at first. And not going to lie, I absolutely love how doogle handles images. Flawless for what it is, and with millions of entries when I tested it, absolutely no lag. Versatile for sure and a good amount of headroom to work with! Still ultimately, I need to figure out some mysql alternative. What a headache though lol |
Beta Was this translation helpful? Give feedback.
-
I just learned about Tarantool. This may have the same issue as all the other nosql database had (not being able to do wildcard searches); but I'm excited enough to share it anyway! And I'll look into it further. This could completely be inapplicable to the project but it's still really cool. (Very little testing on my end. Just ensured it was able to be installed and it worked, will implement it in one of my projects later, where the real testimony will come out) Tarantool is an in-memory computing platform that combines the capabilities of a high-performance NoSQL database with a flexible application server, offering a unique approach to handling data-intensive applications. It's designed to process requests at the speed of RAM. Tarantool supports a variety of data models, including key-value, document, and even relational models.
Tarantool supports persistent storage alongside its primary in-memory capabilities:
Might I specify here, you can choose to completely store the database on disk, only giving it a certain amount of ram for cache/buffering operation/ I believe indexes too
But there's a learning curve associated with adopting Tarantool's data models and Lua scripting. More!
Cons:
Sidenote: sorry for not really being on this repo/project. Sidetracked like a mf. It's one project to the next. But, I have been messing with large datasets and large quantities of network requests as of late so when I do come back to it, it should be quite a fun time. The only unfortunate aspect is .. I kind of misplaced the doogle source code I had . I think I have it. But I don't know if I actually have the latest version of it? Idk, If not the latest it looks like I have a updated enough version to be able to work from. Plus, I've been learning a bit as of late anyway so there's a fine chance I wouldn't even need it. I was an extreme fan of my crawling implementation. It crawled and collected more data than I could handle, I had 80+ second search queries. I also didn't optimize what I crawled whatsoever among other things. Still, fun times. This still is a overarching goal. |
Beta Was this translation helpful? Give feedback.
-
A small note about Doogle being moved from 'safesploit' to 'safesploitOrg'. So, as to avoid any confusion. |
Beta Was this translation helpful? Give feedback.
-
Original Issue title: Memory Leak - calling itself inside of itself, forever, with a solution
Hi!
I've learned something and I've found a solution
Crawler.php
Line 202
foreach($crawling as $site)
My PHP error logs went haywire after running the script for an extended period of time, 971 occurrences of itself and it causes a crash at that point
Stack trace: #2 Crawler.php(494): Crawler->followLinks() #3 Crawler.php(494): Crawler->followLinks() #4 Crawler.php(494): Crawler->followLinks() #5 Crawler.php(494): Crawler->followLinks() . . .
and it goes on and on, up to 971 occurrences.
The issue is that , this means that what the script has been doing is
Get called for Url1 (and doesn't call getDetails for it?)
View first a href on Url1
Visit a href URL(Lets call it URL2) from Url1 and call getDetails on it
Visit the first a href on URL2. calls getdetails
visits the first a href on the last a href on url 2 from url1.
visits the first a href on the last a href on the last a href on url2 from url 1.
etc etc etc, it goes on and on forever UNTIL one subprocess doesn't have any a hrefs or all the a hrefs were processed, then it goes to its parent node.
Meaning
The first original array worth of a href URLs are not fully processed until the sub-processes finish, and for the sub-processes to finish the sub-sub-processes have to finish, and eventually you get to a point where this happens
Allowed memory size of 16582912000 bytes exhausted (tried to allocate 20480 bytes)
Solution!
Line 166
function followLinks($url)
tofunction followLinks($url, $depth = 0)
170 Insert
if ($depth >= 12) {return;}
replace 12 with how deep you want it to goLine 203: erase and fill with
if(isset($site)){$this->followLinks($site, $depth + 1);}
Alternative Solution
public function count_method_occurrences($method_name) { $backtrace = debug_backtrace(); $count = 0; foreach ($backtrace as $trace) { if (isset($trace['class']) && isset($trace['function']) && $trace['function'] === $method_name) { $count++; } } return $count; }
Call it via
if ($this->count_method_occurrences('followLinks') < 12) { foreach(){} }
*Please note occurrences via this will be +1 than $depth
Beta Was this translation helpful? Give feedback.
All reactions