So much data to process, so little time

With a relentless inbound flow of data sooner or later you have to be a bit clever about how you query it

5:30 on Thursday. Time for a pint. I’m at my regular haunt - technically DPS as the landlord can’t get here until around 7pm. But that’s another story.

I’ve written before about queries becoming slow as data sets balloon, and explored some options for avoiding reports grinding to a halt - or a gateway timeout.

This week I’ve been turning one of my theories into something more tangible and demonstrable. My DyadStats website has about 2.5 million page-request records. Simple queries are still pretty zippy, but those that group by text values can take 10 to 20 seconds to return data - not catastrophic but a bit tedious.

The first part of the main report is 30 days of visitor info. The simple way to speed this up has been to create a new data table that holds a row per day, with each row containing the aggregate data that the report displays.

Similarly the next report section displays totals for n a per month basis, and a new table has been created with a row per month.

After that, each table has been extended to include more columns like number of different user agents, ip addresses and request paths / urls.

After this playing around with things and creating a new reporting engine that decides whether to use stored aggregated data or refresh it, the metaphorical sailing seemed relatively plain. The next challenge was aggregantesi data with more than 1 row per day.

For a particular day there might be 20 different user agents associated with page requests. Storing the number 20 for a given data is trivial, but what if we want to list them?

The answer at the moment is to create a new table specifically for user agents that stores rows containing the ua and the number per day. There is an overhead with this approach in terms of creating potentially lots of meta data, but there are significant advantages. Firstly we can bin it. The meta data is not the data. It’s a mechanism - an accelerator - and can be emptied and/or updated to suit. Secondly it can act like a prism - more on that later. And equally as importantly it should produce reports as fast as static HTML.