Wednesday, July 15, 2015

If I HATE web lists so much - why do i click on them?!

I hate web lists.  I really do.  Especially the ones that list things ONE-AT-TIME so you have to load a page that is 90% ads. Irritating!  But, I saw one about "America's Top 100 Colleges". I work at a University - I wondered if we made the top 100.  So, I took the bait.

First I am sent to a page telling me what the list is about WITH LOTS OF ADS and a NEXT button.  I click NEXT. Ok, number 1, lots of ads another NEXT button.  This has already gotten old for me.  I was just curious what the top 100 colleges were - can't they just show me a list?!

So, I think to myself "I wonder if I could just harvest the list from the website with Perl.  I look at the URL - I see what they're doing - simply incrementing the folder by 1 each time.  Easy!  Next I look at the source code: TOO EASY!  They have only the STRONG tag once per page for the school name.  Cool!  So, two minutes of code and then: Tah-dah!


#!/usr/bin/perl -w
# Remember - this took two minutes, I know it's not perfect

use LWP::Simple;

my $rank=100;
my $base="http://www3.forbes.com/forbeswoman/americas-top-100-colleges/";
while ($rank >0) {
    my $url = $base . "/$rank/";
    my $html = get($url);
    $html =~ m{<p><strong>(.*)</strong></p>}
          } or die "Cannot find school: $rank\n";
    $school = $1;
    print "$rank: $school\n";
    $rank--;
}

I run it.  It doesn't work (I think).  It isn't changing the name.  Hmmm.  I keep killing it off before it finishes and examine the code.  I cannot see a problem so let it run - this time all the way to the end.  It does NOT change the name until it counts down to 50.  Wild!  I check the site - sure enough - there are no new schools after 50.  This is a list of the (alleged) Top 50 Colleges going under the name Top 100 Colleges.

Ha!  I guess they never thought anyone would ever have the patience to wade through all of those ads to see all 100 (I sure didn't).  This makes me wonder: Is it just a scam to get you to click through ads or did the content creator tell his boss he put all 100 pages up there knowing NO ONE would ever click through all of that.  If it's the latter - sorry buddy (or lady) for exposing you.