Page 1 of 1

No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.

Posted: Fri Jul 20, 2018 4:36 pm
by 3ICE
(I paused reading at the "Automating the Search" section to leave this quick comment:)
Looks interesting with what's I'm sure many a clever data dump idea, but I lost some interest because the simplest, stupidest solution — splitting the search into "AIza-", "AIza0", "AIza1", "AIza2", ... "AIzaq" ... "AIzaZ" — was not mentioned.
If necessary two (or even three) steps deep, as in:
"AIza--" "AIza-0" "AIza-1" ... "AIza-Z"
"AIza0-" "AIza00" "AIza01" ... "AIza0Z"
...
"AIzaZ-" "AIzaZ0" "AIzaZ1" ... "AIzaZZ"

No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.

Re: No matter how clever you are, the stupidest (simple) solutions outperform the "smart" (complicated) ones.

Posted: Fri Jul 20, 2018 4:45 pm
by 3ICE
(Yes, I am still reading the article - very informative! But...)

And then you just had to go and parse the whole HTML source code (tokenizing it, etc) instead of simply lifting every
/AIza[a-zA-Z0-9\-]{35}/
string with regex... Waste of processing power.

Edit: So you DID use regex in the end. And even better than mine, (simpler)
/AIza.{35}/
Should have started with that, instead of constructing a tag hierarchy from HTML code, filtering it, etc. Tokenizing HTML is far more expensive and uses orders of magnitude more regex queries than a single search over plaintext would. Of course C code working with raw string comparisons would... but no, let's not go there.

Edit: Finished reading, thank you for the write up. I stopped picking apart the second half of the article because there was nothing to complain about. Harvesting, deduplication, search result unpagination, etc. are all top notch. :)