(I paused reading at the "Automating the Search" section to leave this quick comment:)
Looks interesting with what's I'm sure many a clever data dump idea, but I lost some interest because the simplest, stupidest solution — splitting the search into "AIza-", "AIza0", "AIza1", "AIza2", ... "AIzaq" ... "AIzaZ" — was not mentioned.
If necessary two (or even three) steps deep, as in:
"AIza--" "AIza-0" "AIza-1" ... "AIza-Z"
"AIza0-" "AIza00" "AIza01" ... "AIza0Z"
...
"AIzaZ-" "AIzaZ0" "AIzaZ1" ... "AIzaZZ"
No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.
No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.
Forum rules
The off topic has no rules :)
The off topic has no rules :)
- 3ICE
- Admin
- Posts: 2631
- Joined: Sat Mar 01, 2008 11:34 pm
- Realm: Europe
- Account: 3ICE
- Clan: 3ICE
- Location: Hungary
- Contact:
Re: No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.
(Yes, I am still reading the article - very informative! But...)
And then you just had to go and parse the whole HTML source code (tokenizing it, etc) instead of simply lifting every
/AIza[a-zA-Z0-9\-]{35}/
string with regex... Waste of processing power.
Edit: So you DID use regex in the end. And even better than mine, (simpler)
/AIza.{35}/
Should have started with that, instead of constructing a tag hierarchy from HTML code, filtering it, etc. Tokenizing HTML is far more expensive and uses orders of magnitude more regex queries than a single search over plaintext would. Of course C code working with raw string comparisons would... but no, let's not go there.
Edit: Finished reading, thank you for the write up. I stopped picking apart the second half of the article because there was nothing to complain about. Harvesting, deduplication, search result unpagination, etc. are all top notch. :)
And then you just had to go and parse the whole HTML source code (tokenizing it, etc) instead of simply lifting every
/AIza[a-zA-Z0-9\-]{35}/
string with regex... Waste of processing power.
Edit: So you DID use regex in the end. And even better than mine, (simpler)
/AIza.{35}/
Should have started with that, instead of constructing a tag hierarchy from HTML code, filtering it, etc. Tokenizing HTML is far more expensive and uses orders of magnitude more regex queries than a single search over plaintext would. Of course C code working with raw string comparisons would... but no, let's not go there.
Edit: Finished reading, thank you for the write up. I stopped picking apart the second half of the article because there was nothing to complain about. Harvesting, deduplication, search result unpagination, etc. are all top notch. :)
Who is online
Users browsing this forum: No registered users and 78 guests