Reporting In NoSQL
I was on the Herding Code podcast the other night and it was a ton of fun – I managed to hijack the conversation a number of times and make K. Scott Allen wonder why he even showed up. At one point he just flat out said
“… this podcast has left me… so confused…”
It was good fun :). One of the things we discussed was NoSQL and we ran over many of the same questions that I’ve been asked over the years with respect to using an object database. In fact, since then, I’ve received a number of emails asking me these very questions once again! Which is great! It means people are thinking it through and that’s all I could really ask for :).
Instead of carrying on 3 email threads and some thoughts on Twitter – I’ll “put it on the blog” so people can flame me :).
The Phantom Limb
Put as gently as I can –
relational systems are an answer to a problem that we faced 30 years ago. What you’re doing now is nothing other than compensating for a lack of imagination from the platform developers. Think about it – we code using Object Oriented approaches, we store those objects in a relational system.
Imagine, if you will, going to a bank with your paycheck. You deposit your money and the teller has to pull out a complex bit of machinery in order to put your money in the vault. You might just ask the teller:
“Excuse me, but WTF are you doing?”
The teller looks back at you and says…
“Huh? We store our currency in
Gold Dinar which allows us to correlate our records with Asian and Middle Eastern countries as well as run reports with Byzantine baselines.”
Banks and the Byzantine empire roughly correspond to 2010 and the technically pubescent 70s and 80s. You might think I’m reaching here – but I’m really not.
Storing an application’s data in a system built to run millions of transactions a second is complete… insanity
. Notice I didn’t say “transacting” – I said “storing”. More on this below.
What About Reporting?
This is, by far, the number one question I get asked. And my answer, which I’m sure is super aggravating is “what about it?”. Think of this in terms of programming – what are the needs here, what are the requirements?
Generally speaking, reporting is “the proving” of your application’s validity – its relevance to the business people who’ve hired you and the only measure of “was this worth it?”. You should actually be thinking of this first, before you code anything, as it will be the thing that gets you fired when the application you’ve created sucks and people hate it. OR it will be the thing that makes you a hero with a lifelong contract that comes with all the love in the world. It’s the proof of your skill – are people buying/using/paying attention?
The answer seems obvious – you need to store this information in some kind of system that allows you to
lie through your teeth and construct false reports that make you look awesome allow you to report accurately and ask various questions. A database, if you will.
So yes – you don’t want to be storing this kind of information in an Object or Document database – you want something that allows you to make associations on the fly, calculations on the fly – aggregations based on historical data that isn’t subject to change or corruption based on the need to always be online.
Once you star thinking this way you begin to realize that you’ve (most likely) been doin it wrong for a long, long time. Historical records (like an account balance) should almost never be deleted – they should be adjusted with a record that describes the adjustment and, more accurately, should lay out a trail of popcorn through the digital forest so you, your client, and your customers know precisely what happened.
You ever built a reporting system and have your client ask you:
“why doesn’t this report line up and reconcile with last month?”
I have, and it was embarrassing. It didn’t line up because the client changed some info in the database and it rippled through to this month’s report:
Oh, to see a more accurate report you need to run last month’s report as well – and all the previous months. And re-send them to everyone. Sorry about that – you know – what can I do … computers these days…
(run away and feel really dumb…)
I’m getting ahead of myself. Let’s regroup.
Relational Databases Are The Middle Children Of Tech History
Put simply – they solve a problem that no longer exists. Namely “where do I store this rudimentary data from my users?” It used to be, back in the 80s and 90s,
“just stick it in our [Monolithic Server] because we’ll need it there anyway – I’ll run some queries to fix it up. Here’s your table, it’s called XQW6T.”
We don’t have this issue anymore, yet we’re still stuck on the notion that “I need to store my data there because… that’s the way we’ve always done it.” Precisely. We used cut people open and “bleed out illnesses” but, well, we’ve kind of moved on. Although I hear that was quite the struggle back then.
Let’s cut to it – most people store their data in a relational system (SQL Server, for instance) and run reports of same system. If I may be so bold, this is (in no particular order):
. Most of the time you’re running these queries on a live system that really should be paying attention to the people you’re trying to sell to…
. You could, quite easily, code up a routine that might lock a table or worse, slow your system to a crawl. I’ve done both, felt like an idiot. Didn't get fired. Should have.
. All it takes is an over-zealous marketing-type to decide that “Fluffy Pink Bunnies” should now be called “Frontal Assault Dynamite Fur Marauders” and … oops… all sales reports will now reflect that customers bought the latter product even though it didn’t exist 4 years ago.
. Most of the time, when running these reports, you need to scrub them somehow – making sure they’re accurate. How many times have you been caught with your client calling you with some absurd sales number and you suddenly remember that the $12 billion order that you ran through, for laughs, wasn’t deleted like you thought it was. Yes you did – don’t lie.
. Relational systems are supposed to be transactional, not historical or analytical. There are better tools for that. Relational theory made sense years and years ago when we needed to worry about hard drives – not anymore.RDBMS don’t fit for holding your application’s data, and they don’t fit for reporting. They’re a solution for a problem that doesn’t exist anymore. Time to kick them to the curb.
It’s worth mentioning that what I’m describing here already exists – it’s called an OnLine Analytical Process system – OLAP – and it’s built to report off tables structured and pre-aggregated in a certain way. They exists for the sole purpose of historical reporting. People make a bundle off this stuff and its a deep, deep field. I could get into it but you’d leave… so I’ll just recommend you hit Wikipedia.
Separate Your Data Concerns
Here it is: store your reporting data in a queryable system, flattened for performance and proper record keeping. Index it for its purpose – REPORTING – nothing else. Store your application’s historical record in there with all appropriate auditing.
In short: report from a targeted, specialized reporting system (SQL Server works wonderfully) and let it do its job. Focus your needs on the task at hand – getting your domain right and making sure the application works to spec – shove the “effluent”, if you will, into a system that is not only built to handle it but pushes you to build in safeguards and proper practices for storing historical data.
Products, Categories, Tags – these things are current information and belong in application storage and I can’t think of a better solution than an Object/Document database – because that’s what they are!
But What If I Change Something on My Object?
The short answer is this: if you change a property type the same thing will happen that happens in your application already. The DB will try and accommodate it but if it can’t, it’s null. What would you do with an RDBMS? Same thing, no doubt.
If you change a property name, how will the DB know? The short answer is that it won’t in the same way the compiler won’t. This is just storage of objects – it’s very very simple and it requires a shift in thinking from you. If you change “MyObject.PuppyChunks” to “MyObject.PoopyChunks” you’ll have to tell the storage bits about it. This can usually be done using a tool the system provides – or you can make a Rake task.
Generally speaking, however,
your testing should get you to a point where you have your names figured out already yes? Before you need to store anything? So this should be a non-issue.
Free Your Mind
If you sit back with a nice Zinfandel and think, in the purist sense, about what it is your doing for your application you begin to see this very natural divide between what the application needs to run versus the stories it tells about its lifetime. The same is true for people.
You drive a car, eat breakfast lunch and dinner, breath air and drink water. These are important things, to be sure, but when someone asks you “hey how’s it going"?” – you don’t really tell stories about every breath you took, meal you ate, etc. You’d probably opt for scanning your memory of the relevant things – what that breath, food, and water allowed you to *
* with your time here on earth.
That’s what is retained in your memory – your analytical system if you will – not your current (or application) memory which is focused on “what should I eat next” or “I need to pee”.
The change is coming. It’s upon us and since I started out with a quote from K. Scott, I’ll end with one:
“It’s what we like to say: be the change you want to see”