So much data to process, so little time / CyberMat

With a relentless inbound flow of data sooner or later you have to be a bit clever about how you query it

5:30 on Thursday. Time for a pint. I’m at my regular haunt - technically DPS as the landlord can’t get here until around 7pm. But that’s another story.

I’ve written before about queries becoming slow as data sets balloon, and explored some options for avoiding reports grinding to a halt - or a gateway timeout.

This week I’ve been turning one of my theories into something more tangible and demonstrable. My DyadStats website has about 2.5 million page-request records. Simple queries are still pretty zippy, but those that group by text values can take 10 to 20 seconds to return data - not catastrophic but a bit tedious.

The first part of the main report is 30 days of visitor info. The simple way to speed this up has been to create a new data table that holds a row per day, with each row containing the aggregate data that the report displays.

Similarly the next report section displays totals for n a per month basis, and a new table has been created with a row per month.

After that, each table has been extended to include more columns like number of different user agents, ip addresses and request paths / urls.

After this playing around with things and creating a new reporting engine that decides whether to use stored aggregated data or refresh it, the metaphorical sailing seemed relatively plain. The next challenge was aggregantesi data with more than 1 row per day.

For a particular day there might be 20 different user agents associated with page requests. Storing the number 20 for a given data is trivial, but what if we want to list them?

The answer at the moment is to create a new table specifically for user agents that stores rows containing the ua and the number per day. There is an overhead with this approach in terms of creating potentially lots of meta data, but there are significant advantages. Firstly we can bin it. The meta data is not the data. It’s a mechanism - an accelerator - and can be emptied and/or updated to suit. Secondly it can act like a prism - more on that later. And equally as importantly it should produce reports as fast as static HTML.

Templates are to some extent inevitable when creating layouts for web-pages and emails, but I’ve never been their biggest fan.

Time for a beer

Or perhaps only having a few minutes per day for my own work means I have a fresh perspective each time.

I think it's time for a radical layout change. Time for some Blog Roll!

With a relentless inbound flow of data sooner or later you have to be a bit clever about how you query it

4 5

So much data to process, so little time

RFPN

Templates and containers
Solution or problem?

Beer o’clock
It’s 5:30 on a quiet Thursday

Progress
Apparently having less time focusses one’s efforts

Tired of white space
Negative space? The standard padding and margins etc.