Technology Archives - Waking up in Geelong

Google Image Search applying OCR to indexed images

Marcus Wong — Mon, 06 Aug 2018 21:30:00 +0000

While doing some online research research I found evidence of some new functionality in Google Image Search – when crawling the web, Google is applying OCR (Optical Character Recognition) to the images that it finds, and uses this data in their search index.

I was writing a post about the use of antimacassars onboard V/Line trains, so started researching the Australian supplier of the seat headrest covers.

My search term in Google was ‘merino headrex’, which only brought one relevant search result: a copyright application for the ‘Headrex’ name by Encore Tissue (Aust) Pty Ltd, owner of the ‘Merino’ brand.

Bing Search also delivered similar results for the same search query.

But when I flicked over to Google Image Search, something new appeared.

A photo of mine, that was what let me to search for ‘merino headrex’ in the first place.

But the spooky part – I had never put the words ‘merino’ or ‘headrex’ anywhere on my website.

So the most likely explanation – Google is applying OCR to the images that it finds, then adds the data to their search index.

More on Google and OCR

Over the years a number of Search Engine Optimization (SEO) blogs have speculated around Google’s search indexing capabilities.

From TechCrunch in January 2008:

A patent application lodged by Google in July 2007 but recently made public seeks to patent a method where by robots (computers) can read and understand text in images and video.

The extension of the application would be that images and video indexed by Google would be searchable by the text located within the image or video itself, a big step forward in indexing that has not previously been available.

Information Week suggests that privacy issues raised by Google Maps Street View will get more complicated as eventually YouTube videos will be indexable via the text that appears within them.

‘SEO by the Sea’ in November 2015.

I had some hope over the years that Google might get better at indexing text that appeared within links, watching some things like the following happen:

(1) Google acquired Facial and object recognition company Nevenvision in 2006, and a few other companies that can recognize images.

(2) In 2007, Google was granted a patent that used OCR (Optical Character Recognition) to check upon the postal addresses on business listings, to verify those businesses in Google Maps.

(3) Google was granted a similar patent in 2012 that read signs in buildings in Street Views images.

(4) In 2011, Google published a patent application that used a range of recognition features (object, facial, barcodes, landmarks, text, products, named entities) focusing upon searching for and understanding visual queries, which looks like it may have turned into the application for Google Goggles, which came out in September of 2010 – the visual queries patent was filed by Google in August, 2010, the nearness in time with the filing of the patent and the introduction of Google Goggles reinforces the idea that they are related.

But, Googlebot still doesn’t seem to be able to read text in images for purposes of indexing addresses, or to read images of text used in navigation. I added the text “Google Test” to the following image, and then ran it through a reverse image search at Google. The images returned were similar looking, but none of them had anything to do with the text I added to the image.

And ‘Search Engine Roundtable’ in March 2016:

A question was posed to Google’s Gary Illye’s on Twitter if Google’s crawler and indexer understands the text embedded in an image, maybe through OCR or other techniques. I am surprised to hear Gary say no.

@Web4Raw I say no

— Gary Illyes ᕕ( ᐛ )ᕗ (@methode) March 15, 2016

A year is a long time on the internet.

Footnote on Google Image Search for obscure topics

Take a look at the other results from Google Image Search, and spot the odd one out.

My photo of the seat covers, and the Merino sheep make sense. But these three photos…

Railway tracks on a wharf.

People in hi-vis vests standing around a pile of wood.

And a train covered in a tarpaulin.

They have nothing at all to do with a merino sheep, but they do have one thing in common – they are hosted on the same domain as my ‘merino headrex’ image.

Thanks to lack of any other relevant results, Google’s algorithms decided that proximity to a relevant image is enough of a ranking signal to push it up the search result pages.

I’ve confused Google’s algorithms in this way before, with my Hong Kong themed blog at www.checkerboardhill.com/.

I searched Google for “Sheung Shui slaughterhouse” but was given my own photo of an Australian diesel locomotive!

Post retrieved by 35.215.163.46 using

The post Google Image Search applying OCR to indexed images appeared first on Waking up in Geelong.

How far is Myki making you walk?

Marcus Wong — Mon, 20 Mar 2017 20:30:06 +0000

If you want to catch a tram in Melbourne then you need a Myki, despite the fact you can’t buy one or top it up onboard the tram. In 2017 The Age highlighted the difficulty this can pose for intending tram passengers, in an article on myki “dead zones” – tram stops where the nearest place to top up your myki is at least a kilometre away. Coincidently I started work on an almost identical project years ago but never finished it, so what better time to polish it off?

Some background

An integral part of the original Myki system was ticket machines onboard trams – they would have allowed passengers to top up their myki, or to purchase a ‘Short Term Ticket’ if they didn’t hold a myki.

A single ticket machine was installed onboard a Melbourne tram in early 2009 as part of the myki field trial program, with it remaining in place but not in use until at least November 2011.

Short Term Tickets were a cardboard smartcard which entitled the holder to 2 hour of travel, for a cost slightly more than the normal fare charged to standard myki users.

Sales of these tickets commenced in 2009 when Myki went live in Geelong, and they continue to be sold onboard buses in Geelong, Ballarat, Bendigo, Seymour and the Latrobe Valley until 2013.

In Melbourne the sale of Short Term Tickets was never enabled, with the option to do so being disabled for all myki machines located in the city.

The rollout of short term fares and myki machines onboard trams was cancelled by the Baillieu Government in June 2011, acting on advice contained in a secret report by consulting firm Deloitte.

One reason given for the reason for the withdrawal of Short Term Ticket was due to the cards costing $0.40 cents each to manufacture – making up almost half of the $0.90 charged for a concession bus fare in Geelong!

In 2011 Yarra Trams said that the change would reduce the tram company’s costs, boost space for passengers and reduce fare evasion issues by eliminating a key reason given for not buying a ticket.

Everyone else says the cancellation of Short Term Tickets and onboard top ups make it much more difficult for passengers who only use trams to pay their fare.

Enter the Public Transport Victoria API

Back in March 2014 Public Transport Victoria finally opened up the application program interface (API) which powers their mobile apps, so I decided to have a play around with it.

With the mobile landscape already littered with hundreds of different trip planning apps, I decided to build something slightly different – something to point out how the lack of ticket purchase options onboard trams was wasting the time of the intending passengers.

The API allows programmers to access all kinds of data – tram routes and Myki retailers being two of them, so I built an app that caters for two use cases:

you’re at home, work, or a friend’s house – and you’ve discovered that you don’t have a Myki on hand. Where is the nearest place to buy a new one, and how far will this detour take compared to purchasing a ticket onboard the tram?
you’ve just stepped onto a tram and discovered that you don’t have any credit left on your Myki. How far will you have to walk to top up, and then where can you get back on your way?

The end result is ‘walki‘ – a small app that works on any device with a web browser.

The logic in the app is as follows:

Show the user their current location,
Calculate distance to nearest tram stop,
Calculate distance to nearest Myki retailer,
Calculate distance from Myki retailer back to nearest tram stop,
Plot the walking routes on a map,
Compare the distances for each,
And finally, show the user much further they have to walk thanks to the lack of ticket sales onboard trams.

Simple?

You can see it for yourself at https://wongm.com/walki/, or using these examples.

Technology

The app itself isn’t anything revolutionary from a technology standpoint.

In the backend I’m using boring old PHP to gather tram stop and myki retailer locations through calls to the PTV API, with the resulting data being mashed around in Javascript until they are drawn out on a pretty map.

The frontend code is static HTML files with a smattering of jQuery Mobile (‘state of the art’ for 2014? :-P) over the top, with the maps being drawn using the Google Maps JavaScript API v3.

The files are all hosted on my vanilla Apache web server, and you can find the source code on GitHub.

Footnote

Here is the original article from The Age – Melbourne’s myki retailers: Where is the nearest place to top up my myki?, by Craig Butt and Andy Ball.

They are the myki dead zones – the tram stops where the nearest place to top up your myki is at least a kilometre away.

If you do not have any credit on your myki you are expected to take reasonable steps to top up, but from these locations you rack up at least 1200 paces to get to the nearest store or machine.

Using the interactive below, you can detect myki dead zones on your tram line and find out where the nearest myki retailer is, in case you ever find yourself short on credit.

One consideration I completely forgot about was the opening hours of Myki retailers:

But keep in mind the opening times of the retailer. When we trekked 1.5 kilometres through Melbourne’s biggest myki desert on a scorching 31 degree day to top up at Bundoora Post Office, we were fortunate enough to get there half an hour before closing time.

But we would have been out of luck after 5pm that afternoon, after midday on Saturday, or when it is closed all day on Sunday.

About 95 per cent of the state’s 800-plus myki retail outlets are open on Saturday, 75 per cent on Sunday and 32.5 per cent are open all hours.

But one thing we did agree on is the lack of Myki retailers along route 86.

And if you live near Plenty Road in Bundoora and rely on the route 86 tram, you’re in the worst spot in Melbourne.

The stop at the corner of Plenty Road and Greenwood Drive is the worst myki dead zone in Victoria for trams. If you forget your card or are out of credit, it’s a 1.5 kilometre walk to the nearest post office to top up.

I included a few examples in my app, one from Reservoir resulting in an extra 1.86 kilometre (23 minute) long walk!

Post retrieved by 35.215.163.46 using

The post How far is Myki making you walk? appeared first on Waking up in Geelong.

Married men and Facebook ads for “Singles Events Melbourne”

Marcus Wong — Thu, 09 Jun 2016 21:30:31 +0000

The other day a suggested post for “Singles Events Melbourne” showed up in my Facebook timeline. Since I’m married, it isn’t exactly something I’m interested in.

I initially thought it a case of mistargeted Facebook advertising, so I followed the “Why am I seeing this?” link to find out why Facebook had decided to show me the ad.

And there lay the answer.

I tick all four boxes:

Relationship status: married
Gender: male
Age: 18+
Location: Melbourne

Yet I still don’t care.

Post retrieved by 35.215.163.46 using

The post Married men and Facebook ads for “Singles Events Melbourne” appeared first on Waking up in Geelong.

‘Altostrat’ – Google’s fictitious company

Marcus Wong — Thu, 30 Jul 2015 21:30:42 +0000

Microsoft is well known for their long list of fictional companies – the name of which crop up often in the documentation for their software. However they aren’t the only company to do so – Google has a fake company of their own – ‘Altostrat’.

The altostrat.com domain itself was created in November 2007.

A quick Google Search brings up this intranet page for ‘Project Eggplant‘ at Altostrat.

It also brings up the homepage of ‘Altostrat Pediatrics‘.

The Altostrat name makes an appearance in this Google tutorial on ‘Configure Google Apps for Calendar Interop‘.

And this tutorial on ‘Publish a private Chrome app‘.

So where did the name come from?

Getting to the bottom of the mystery

When I filter the Google search results to exclude pages before February 2008, the number of ‘Altostrat’ mentions falls off a cliff.

Why is that date significant? Turns out Google launched their new ‘Google Sites’ product on February 28, 2008.

Google launched Google Sites, basically a relaunch of Jotspot but with many more features. In short, this new software allows teams to share much like you could with Microsoft’s SharePoint.

According to TechCrunch, it took 16 months to relaunch Jotspot as Google Sites, so the registration of the domain in November 2007 fits with the same development timelines.

So my final explanation – Altostrat was an internal name used by the Google Sites team during the redevelopment of their product, and has been reused by others at Google in the years that have followed.

Post retrieved by 35.215.163.46 using

The post ‘Altostrat’ – Google’s fictitious company appeared first on Waking up in Geelong.

Fixing my blog robot

Marcus Wong — Sun, 24 May 2015 21:30:54 +0000

One thing you might not know about this site is that I don’t actually wake up each morning and type up a new blog post – I actually write them ahead of time, and set them up to be published at a future time. Unfortunately this doesn’t always work, such as what happened to me a few weeks ago.

I use WordPress to host my various blog sites, and it has a feature called “scheduled posts” – set the time you want the post to go online, and in theory they’ll magically appear in the future, without any manual intervention.

For this magic to happen, WordPress has to regularly check what time it is, check if any posts are due to be published, and if so, publish them – a process that is triggered in two different ways:

run the check every time someone visits the site, or
run the check based on a cron job (scheduled task)

The first option is unreliable because it delays page load times, and you can’t count on people visiting a low traffic web site, so the second option is what I put in place when setting up my server.

I first encountered troubles with my scheduled posts in early April.

Why has my blog robot been broken all week?

— Marcus Wong (@aussiewongm) April 2, 2015

My initial theory was that a recently installed WordPress plugin was to blame, running at the same time as the scheduled post logic and slowing it down.

Looks like my blog robot is working again (I think the ‘Broken Link Checker’ plugin was making the WordPress cron page time out)

— Marcus Wong (@aussiewongm) April 10, 2015

I removed the plugin, and scheduled posts on this site started to work again – I thought it was all fixed.

However, a few weeks later I discovered that new entries for my Hong Kong blog were missing in action.

Turns out my blog robot still isn’t working right, as it missed a post from April 8th.

— Marcus Wong (@aussiewongm) April 14, 2015

I took a look the the config for my cron job, and it seemed to be correct.

*/2 * * * * curl http://example.com/wp-cron.php > /dev/null 2>&1

I hit the URL featured in the command, and it triggered the publication of a new blog post – so everything good on that front!

I then dug a bit deeper, and ran the curl command directly on my server.

user@server:~$ curl http://example.com/wp-cron.php 301 Moved Permanently

Moved Permanently

The document has moved here.

Apache Server at example.com Port 80

Bingo – I had found my problem!

Turns out I had previous added a non-www to www redirect for the website in question via a new .htaccess rule – and by default curl doesn’t follow HTTP redirects.

The end result was my cron job hitting a URL, finding a redirect but not following it, resulting in the PHP code never being executed, and my future dated blog posts laying in limbo.

my fix was simple – update my cron job to hit the www. version of the URL, and since then, my future dated blog posts have all appeared on the days they were supposed to.

About the lead photo

The train in the lead photo is the Melbourne-Sydney XPT – on 11 July 2014 it derailed near North Melbourne Station due to a brand new but poorly designed turnout.

Post retrieved by 35.215.163.46 using

The post Fixing my blog robot appeared first on Waking up in Geelong.

Tracing a performance issue on my web server

Marcus Wong — Mon, 05 Jan 2015 20:30:07 +0000

Managing my various web sites can be difficult at times, and my experience the other weekend was no different. My day started normally enough, as I logged onto my VPS and installed the latest security patches, then set to work on uploading new photos to my site. It was then I noticed my web site was taking minutes to load pages, not seconds, so I started to dig into the cause.

My initial setup

After I moved from shared web hosting, my collection of websites had been running on a $5 / month VPS from Digital Ocean – for that I got 1 CPU, 512 MB of RAM, and 20 GB of disk space. On top of that I used an out-of-the-box Ubuntu image, and installed Apache for the web server and MySQL for the database server.

I then installed a number of separate WordPress instances for my blogs, a few copies of Zenphoto to drive my different photo galleries, and then a mishmash of custom code for a number of other side projects. All of that is exposed via four different domain names, all of which sit behind the CloudFlare CDN to reduce the load on my server.

With some many web sites running on just 512 MB of RAM, performance was an issue! My first fix was to setup a 1 GB swap file to give some breathing room, which did stabilise the situation, but MySQL would still crash every few days when the server ran out of memory.

Swapping out Apache for the much less memory intensive Nginx web server is one way to fix the issue, but I didn’t have time for that. My solution – cron jobs to check the status of my server and restart the services as required!

The first script I came up with checked if the MySQL service was running, and start it up if it wasn’t.

service mysql status| grep 'mysql start/running' > /dev/null 2>&1 if [ $? != 0 ] then SUBJECT="MySQL service restarted $(date)" service mysql status|mail -s "$SUBJECT" me@example.com sudo service mysql start fi

My second attempt negated the need for the first script, as it checked to see how much memory was free on my server, and restarted Apache if it was less than a given threshold.

#Minimum available memory limit, MB THRESHOLD=300
available=$(free -m|awk '/^Swap:/{print $4}') if [ $available -lt $THRESHOLD ] then SUBJECT="Apache service restarted $(date)" service apache2 status|mail -s "$SUBJECT" me@example.com sudo service apache2 restart fi

Under normal load my cron job would restart Apache every day or so, but it did keep the database server up for the rest of the time.

Something is not right

After realising my web site was taking minutes to load pages, not seconds, I started to dig into my server logs. CPU load was hitting 100%, as was memory consumption, and my cron job was restarting Apache every few minutes – something wasn’t quite right!

My first avenue of investigation was Google Analytics – I wanted to find out if the spike in load was due to a flood of new traffic. While the Slashdot effect is a nice problem to have, but in my case it wasn’t to be – incoming traffic was normal.

I then took a look at my Apache access logs – they are split up by virtual host, so I had a number of log files to check out. The first suspicious entries I found were brute force attacks on my WordPress login pages – blocking those was simple, but the server load was still high.

Spending my way out

When looking to upgrade a system to handle more traffic, there are two completely different ways to go about it:

Be smart and optimise what you already have, to do more with the same resources
Throw more resources at the problem, and just ignore the cause

My server was already nearing the 20 GB disk space limitation set by Digital Ocean on their $5 / month VPS, so I figured an upgrade to next size VPS might fix my problem. Upgrading a Digital Ocean ‘droplet’ is simple job with their ‘Fast-Resize’ functionality – it takes about a minute, but in my case the option wasn’t available – I had to do it the hard way:

shut down my server,
create a snapshot of the stopped virtual machine,
spin up a new Digital Ocean server,
restore my snapshot to the new server,
point CloudFlare from my old server IP address to the new one.

All up it took around 30 minutes to migrate from my old server to my new one, but at least with CloudFlare being my public facing DNS host, I didn’t have to wait hours for my new IP address to propagate across the internet!

Unfortunately, the extra resources didn’t fix my problem – CPU load was still through the roof.

Digging for the root cause

I first installed the htop process viewer on my server, and was able to see that MySQL was using up far much more CPU than normal – presumably my caching wasn’t working right, and my web pages were having to be generated with fresh database queries each time.

Next I fired up a MySQL console, and had a look at the currently running queries. Here I noticed a curious looking query over and over again:

SELECT @serachfield ...

A check of the code deployed to my server indicated that the query was thanks to the search function in Zenphoto, and when I went back into my Apache access logs, I eventually found the problem – a flood of hits on my photo gallery.

Each line in the logs looked like the following:

108.162.250.234 – – [21/Dec/2014:04:32:03 -0500] “GET /page/search/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/beacon-3.newrelic.com HTTP/1.1” 404 2825 “https://railgallery.wongm.com/page/search/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/maintenance/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/nr-476.min.js” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)”

Each request was bound for “http://js-agent.newrelic.com/nr-476.min.js” or other files hosted at newrelic.com, and the user agent always appeared to be Internet Explorer 8.

New Relic is a software analytics tool I have installed on my server, and on seeing the multiple references to it in my access logs, I remembered that I had updated my version of the New Relic agent just before my performance issues had started. Had I found a bug in it?

The cause

A check of the HTML source of the page in question showed a link to js-agent.newrelic.com embedded in the page, so I came up with the following explanation for the load on my server:

A user hits https://railgallery.wongm.com/page/search/SEARCH_TERM
The New Relic Javascript file at http://js-agent.newrelic.com/nr-476.min.js somehow gets loaded as a relative path, and not an absolute one, which results in a request to:
https://railgallery.wongm.com/page/search/SEARCH_TERM/js-agent.newrelic.com/nr-476.min.js
My server would then treat the above URL as valid, delivering a page, which then includes a relative link to js-agent.newrelic.com/nr-476.min.js a second time, which then results in a page request to this URL:
https://railgallery.wongm.com/page/search/SEARCH_TERM/js-agent.newrelic.com/js-agent.newrelic.com/nr-476.min.js
And so on recursively:
https://railgallery.wongm.com/page/search/SEARCH_TERM/js-agent.newrelic.com/js-agent.newrelic.com/js-agent.newrelic.com/nr-476.min.js

With the loop of recursive page calls for a new set of search results, each requiring a fresh database query, it was no wonder my database server was being hit so hard.

As an interim fix, I modified the Zenphoto code to ignore search terms that referenced New Relic, and then rolled back to the older version of the New Relic agent.

sudo apt-get remove newrelic-php5 sudo apt-get remove newrelic-php5-common sudo apt-get remove newrelic-daemon sudo apt-get autoremove newrelic-php5 sudo apt-get install newrelic-php5-common=4.15.0.74 sudo apt-get install newrelic-daemon=4.15.0.74 sudo apt-get install newrelic-php5=4.15.0.74

I then raised a support case for New Relic to look into my issue. In an attempt to reproduce the issue, I rolled forward with the current version of the New Relic agent to play ‘spot the difference’, but I couldn’t find any, and the errors also stayed away.

I’m writing this one off as a weird conflict between the updated New Relic agent running my server, and an old version of the browser monitor javascript file cached by a single remote user.

Conclusion

After working through my performance issues I now know more about what my web server is doing, and the extra RAM available following the upgrade means my horrible cron job hacks are no longer required to keep the lights on!

As for the steps I will follow next time, here are the places to check:

Google Analytics to check if I am getting a flood of legitimate traffic,
Apache access logs for any odd looking sources of traffic,
current process list to see where the CPU usage is coming from,
currently running MySQL queries for any reoccurring patterns.

Post retrieved by 35.215.163.46 using

The post Tracing a performance issue on my web server appeared first on Waking up in Geelong.

Bank transfers and account name checking

Marcus Wong — Mon, 27 Oct 2014 20:30:51 +0000

I’ve been making electronic transfers from my bank account for years and never had any trouble with them – just plug in the account name, number and BSB code into the form, and a few days later, the money arrives in the destination account.

I recently switched banks and unfortunately for me, a bank transfer failed – the money left my account but didn’t arrive at the destination account – but thankfully it bounced back a few days later! I verified the account number and BSB code I used with the destination bank, who said that it was all correct, but they flagged one possible issue – the account name.

Normally the account name isn’t something I worry about – I don’t always use my middle name, so my account names are all different, yet the bank usually manages to get my money to where it should go. However, it looks like my new bank is a bit more pedantic than my old one, as this thread on the Whirlpool forums suggests:

In a previous thread it was mentioned that ‘big banks’ do not cross reference account names and account numbers, and thus only a BSB and account number will suffice for transfers to go through. It was said credit unions/smaller institutions do manually cross reference account names / account numbers and therefore account names are required.

On further investigation I realised that I entered a dummy name for the bank transfer which failed – problem solved!

The Financial Ombudsman Service has more to say on account names matching when processing electronic transfers, in this document dated September 2003 (my emphasis):

Internet banking screens for online payments commonly require the name, and account number (including the
BSB) of the intended recipient’s account to be keyed in. Traditionally, the account name has been treated as part of the payment instructions on, for example, a deposit slip and the account name has always been an important part of the instructions for payment of a cheque. Payers often assume that the name and account number for a deposit will be checked against each other before the funds are credited to the payee’s account. In practice we know that an electronic transfer is processed solely on the basis of the account number.

This has the effect that, if the payer keys in the wrong account number the payment will be made but to the holder of the account number that has been keyed in. The mistake may only come to light when the intended recipient tells the payer that the payment has not been received. When the payer tries to find out where the payment has actually gone, he or she may be told that the recipient’s name cannot be released for reasons of confidentiality. Their bank may claim that it acted on the basis of the instructions it was given, that is, the account number.

The Ombudsman Service goes on to detail how the Bulk Electronic Clearing System (BECS) rules apply to online transfers, and account name matches – it gets complicated very quickly in regards to which bank is responsible for bank transfers misrouted due to account number / name mismatches.

So the moral of the story seems to be don’t fat finger the account numbers of transfers to big banks, as they might send the money to the wrong person – and pay attention to the account names for transfers to small banks, as they actually pay attention to the small details!

Post retrieved by 35.215.163.46 using

The post Bank transfers and account name checking appeared first on Waking up in Geelong.

Fake Facebook screenshots aren’t hard at all

Marcus Wong — Mon, 06 Oct 2014 20:30:17 +0000

A few weeks ago an article about political staffers and faked Facebook posts appeared the The Age.

Multicultural Affairs Minister Matthew Guy may ask police to investigate a series of Facebook posts purporting to be one of his staffers making highly racist and sexist comments about senior Liberals.

In a bizarre twist to the state government’s recent social media woes, several Facebook screen grabs claiming to come from one of Mr Guy’s employees have been distributed, with offensive references to Asians as “slopes”, Arabs as “towel-heads” and Arts Minister Heidi Victoria as a “dumb blonde.”

A spokesman for Mr Guy, who is also the state’s planning minister, insisted the statements were fabricated, and the Minister’s office is now considering whether to refer them to police on the grounds of fraud and defamation.

But the fact they were distributed in the first place – and the considerable effort it would have taken to get them looking like genuine Facebook material – paints a worrying sign of the battles now being waged in politics using social media.

The “considerable effort” line is the part that caught my eye, as creating a fake Facebook page is incredibly easy once you know what you are doing – such as this example I created years ago.

In the ‘old’ days of the internet creating fake web pages required one to take a screenshot of a source web page, find a font that matches the original, and then add your own text in using Photoshop.

Today you don’t need to go to anywhere near as much effort – just open up the ‘Developer Tools’ panel of your web browser (my example is Google Chrome), find the text you want to change, and then type in your new slanderous text.

I can’t imagine the Greens ever supporting an expansion of brown coal mining in Victoria, but just look at what their Facebook page says!

Post retrieved by 35.215.163.46 using

The post Fake Facebook screenshots aren’t hard at all appeared first on Waking up in Geelong.

Rebuilding all of my websites

Marcus Wong — Wed, 09 Jul 2014 21:30:07 +0000

I’ve had quite busy recently – on Thursday last week I discovered all of my web sites were offline, which resulted in me moving to a new hosting provider, and rebuilding every bit of content. So how did I do it?

Going offline

I first realised something was wrong when I discovered all of my web sites displaying the following ominous error message:

I checked my email, and I couldn’t find any notification from my hosting provider that my account was suspended – a pretty shit job from them!

However, I wasn’t exactly surprised, as over the past few years I’ve been receiving these automated emails from their system:

Your hosting account with username: [XYZ] has over the last few days averaged CPU usage that is in excess of your account allocation.

This could be caused by a number of factors, but is most likely to be due to a misconfigured installation of a 3rd party script, or by having too many features, modules or plugins enabled on your web site.

If you simply have a very busy or popular web site, you may need to upgrade your account which will give you a higher CPU allocation. Please contact our support team if you need help with this.

Until your usage average drops back below your CPU quota, your account will be throttled by our CPU monitoring software. If your account continues to use more CPU than what it is entitled to, you risk having your account suspended.

All up I was running about a dozen different web sites from my single shared web hosting account, and over the years I’ve have had to increase the amount of resources available to my account to deal with the increasing load.

Eventually I ended up on a ‘5 site’ package from my hosting provider, which they were charge me almost $300 a year to provide – a steep price, but I was too lazy to move everything to a new web host, so I just kept on paying it.

Having all of my sites go offline was enough of a push for me to move somewhere new!

What needed to be moved

All up my online presence consisted of a dozen different sites spread across a handful of domain names, running a mix of open source code and code I had written myself. With my original web host inaccessible, I had to rebuild everything from backups.

You do have backups don’t you?

The rebuild

I had been intending to move my web sites to a virtual private server (VPS) for a while, and having to rebuild everything from scratch was the perfect excuse to do so.

I ended up deciding to go with Digital Ocean – they offer low-ish prices, servers in a number of different locations around the world, fast provisioning of new accounts, and an easy migration path to a faster server if you ever need it.

After signing up to their bottom end VPS (512 MB RAM and a single core) I was able to get cracking on the rebuild – they emailed me the root password a minute later and I was in!

As I had a bare server with nothing installed, a lot of housekeeping needed to be done before I could start restoring my sites:

Swapping over the DNS records for my domains to my new host,
Locking down access to the server,
Setting up a swap file,
Installing Apache, MySQL and PHP on the server,
Creating virtual directories on the server for each separate web site,
Creating user accounts and empty databases in MySQL

I’ve only ever played around with Linux a little, but after 30 minutes I had an empty page appearing for each of my domain names.

To get my content back online, thankfully I had the following backups available to me:

I run three blogs on the open source WordPress software, so I could just install that from scratch to get a bare website back
My main photo gallery on the open source ZenPhoto software, so that was another internet download
Each blog and photo gallery uses a custom theme, of which I had backups on my local machine to re-upload
I keep a mirror of my WordPress uploads on my local machine, so I just had to reupload those to make the images work again
When I upload new photos to my gallery, I keep a copy of the web resolution version on my local machine which I was unable to reupload
Every night I have a cron job automatically emailing me a backup copy of my WordPress and ZenPhoto databases to me, so my blog posts and photo captions were safe
Some of my custom web code is available on GitHub, so a simple git pull got those sites back online

Unfortunately I ran into a few issues when restoring my backups (doesn’t everyone…):

My WordPress backup was from the day before, and somebody has posted a new comment that day, so it was lost
I had last mirrored my WordPress uploads about a week before the crash, so I was missing a handful of images
The last few months of database backups for Rail Geelong were only 1kb in size – it appears the MySQL backup job on my old web host was defective
Of the 32,000 photos I once had online, around 2,000 files were missing from the mirror I maintained on my local machine, and the rest of them were in a folder hierarchy that didn’t match that of the database

I wasn’t able to recover the lost comment, but I was able to chase up the missing WordPress uploads from other sources, and thankfully in the case of Rail Geelong my lack of regular updates meant that I only lost a few typographical corrections.

As for the 2,000 missing web resolution images, I still had the original high resolution images available on my computer, so my solution was incredible convoluted:

Move all of the images from the mirror in a single folder
Use SQL to generate a batch file to create the required folder structure
Use more SQL to generate a second batch file, this time to move images into the correct place in the older structure
Run a diff between the images that exist, and those that do not
Track down the 2,000 missing images in my collection of high resolution images, and create a web resolution version in the required location

Three hours after I started, I had my first win.

And my blogs are back! #WebsiteRebuildSpeedrun http://t.co/UQ518Muwn2 http://t.co/iDiKdh8yMR http://t.co/SZVMGKGQNf

— Marcus Wong (@aussiewongm) July 3, 2014

Unfortunately I found a number of niggling issues throughout the night.

Now bashing my head against the wall trying to get mod_rewrite working #WebsiteRebuildSpeedrun

— Marcus Wong (@aussiewongm) July 3, 2014

Protip: page redirect rules won't work if your .htaccess file is empty #WebsiteRebuildSpeedrun

— Marcus Wong (@aussiewongm) July 3, 2014

Failure to install cURL prevented WordPress from sending comment notification emails #WebsiteRebuildSpeedrun

— Marcus Wong (@aussiewongm) July 4, 2014

By 2am I was seven hours in, and had managed to get another domain back online.

Now my horrible kludge of custom code is back online: http://t.co/uxJcmPh7V6 #WebsiteRebuildSpeedrun #allnighter

— Marcus Wong (@aussiewongm) July 3, 2014

Eventually I called it quits at 4am, as I waited for my lethargic ADSL connection to push an elephant up a drinking straw.

How long will it take to upload 5 GB of images? http://t.co/9j5DyxP6LZ #WebsiteRebuildSpeedrun

— Marcus Wong (@aussiewongm) July 3, 2014

I spent the weekend out and about so didn’t get much time to work on rebuilding my content – it wasn’t until the fourth day after my sites went down that I started to track down the 2,000 missing images from my photo gallery.

Thankfully I got a lucky break – on Monday afternoon I somehow regained access to my old web host, so I was able to download all of my missing images, as well as export an up-to-date version of the Rail Geelong database.

After a lot more stuffing around with file permissions and monitoring of memory usage, by Tuesday night it seems that I had finally rebuilt everything and running somewhat reliably!

What’s next

Plenty of people online seem to rave about replacing the Apache web server and standard PHP stack with Nginx and PHP-FPM to increase performance – it’s something I’ll have to try out when I get the time. However for the moment, at least I am back online!

Post retrieved by 35.215.163.46 using

The post Rebuilding all of my websites appeared first on Waking up in Geelong.

News Limited and the ‘sslcam’ redirect

Marcus Wong — Mon, 07 Jul 2014 21:30:10 +0000

Recently I was in the middle of researching a blog post, when my internet connection crapped out, leaving me at an odd looking URL. The middle bit of it made sense – www.theaustralian.com.au – but what is up with the sslcam.news.com.au domain name?

https://sslcam.news.com.au/cam/authorise?channel=pc&url=http%3a%2f%2fwww.theaustralian.com.au%2fbusiness%2flatest%2fsmartphone-app-to-track-public-transport-woes%2fstory-e6frg90f-1226863043667

I then started researching the odd looking domain name, with the only thing of note being somebody else complaining about it:

What is that stupid sslcam redirect you get trying to load News stories? It's so stupidly slow.

— Richard Chirgwin (@R_Chirgwin) March 17, 2012

I then went back to the original link I clicked on, and followed the chain of network activity that followed.

First hit – the shortened link I found on Twitter:

http://t.co/wLP4Lj9kXP

Which redirected to the article on the website of The Australian:

http://www.theaustralian.com.au/business/latest/smartphone-app-to-track-public-transport-woes/story-e6frg90f-1226863043667

When then redirected me to a page to check for cookies – presumably part of their paywall system:

http://www.theaustralian.com.au/remote/check_cookie.html?url=http%3a%2f%2fwww.theaustralian.com.au%2fbusiness%2flatest%2fsmartphone-app-to-track-public-transport-woes%2fstory-e6frg90f-1226863043667

It then sent me back to the original article:

http://www.theaustralian.com.au/business/latest/smartphone-app-to-track-public-transport-woes/story-e6frg90f-1226863043667

Which then bounced me to the mysterious sslcam.news.com.au domain:

https://sslcam.news.com.au/cam/authorise?channel=pc&url=http%3a%2f%2fwww.theaustralian.com.au%2fbusiness%2flatest%2fsmartphone-app-to-track-public-transport-woes%2fstory-e6frg90f-1226863043667

And third request lucky – the original article:

http://www.theaustralian.com.au/business/latest/smartphone-app-to-track-public-transport-woes/story-e6frg90f-1226863043667

Quite the chain of page redirects!

The sslcam.news.com.au domain

Internet services company Netcraft have collated the following information:

Date first seen: January 2012
Organisation: News Limited
Netblock Owner: Akamai International, BV
Nameserver: dns0.news.com.au
Reverse DNS: a23-51-195-181.deploy.static.akamaitechnologies.com

Akamai Technologies is a company that runs a content delivery network used by many media companies use – their systems make websites faster to load by saving a copy of frequently viewed content to servers located closer to the end users.

As for the reason for the cascade of page redirects and the mysterious sslcam.news.com.au domain, I’m at a loss to explain it – sorry!

Footnote

The sslcam.news.com.au domain is also used by other News Limited websites – the Herald Sun also routes traffic to their website via it.

Post retrieved by 35.215.163.46 using

The post News Limited and the ‘sslcam’ redirect appeared first on Waking up in Geelong.