Explaining Distributed Data Consistency to IT novices? Well, …

Greek Shepherd

it’s all greek to me. Bruce Stidston cited a post on Google+ where Yonatan Zunger, Chief Architect of Google+, tried to explain Data Consistency by way of Greeks enacting laws onto statute books on disparate islands. Very long post here. It highlights the challenges of maintaining data consistency when pieces of your data are distributed over many locations, and the logistics of trying to keep them all in sync – in a way that should be understandable to the lay – albeit patient – reader.

The treatise missed out the concept of two-phased commit, which is a way of doing handshakes between two (identical copies) of a database to ensure a transaction gets played successfully on both the master and the replica sited elsewhere on a network. So, if you get some sort of failure mid transaction, both sides get returned to a consistent state without anything going down the cracks. Important if that data is monetary balance transfers between bank accounts for example.

The thing that impressed me most – and which i’d largely taken for granted – is how MongoDB (the most popular Open Source NoSQL Database in the world) can handle virtually all the use cases cited in the article out of the box, with no add-ons. You can specify “happy go lucky”, majority or all replicas consistent before confirming write completion. And if a definitive “Tyrant” fails, there’s an automatic vote among the surviving instances for which secondary copy becomes the new primary (and on rejoining, the changes are journaled back to consistency). And those instances can be distributed in different locations on the internet.

Bruce contended that Google may not like it’s blocking mechanics (which will slow down access while data is written) to retain consistency on it’s own search database. However, I think Google will be very read heavy, and it won’t usually be a disaster if changes are journaled onto new Google search results to its readers. No money to go between the cracks in their case, any changes just appear the next time you enact the same search; one very big moving target.

Ensuring money doesn’t go down the cracks is what Blockchains design out (majority votes, then change declines to update attempts after that’s achieved). That’s why it can take up to 10 minutes for a Bitcoin transaction to get verified. I wrote introductory pieces about Bitcoin and potential Blockchain applications some time back if those are of interest.

So, i’m sure there must be a more pithy summary someone could draw, but it would add blockchains to the discussion, and probably relate some of the artistry behind hashes and Git/Github to manage large, multiuser, multiple location code, data and writing projects. However, that’s for the IT guys. They should know this stuff, and know what to apply in any given business context.

Footnote: I’ve related MongoDB as that is the one NoSQL database I have accreditations in, having completed two excellent online courses with them (while i’m typically a senior manager, I like to dip into new technologies to understand their capabilities – and to act as a bullshit repellent!). Details of said courses here. The same functionality may well be available with other NoSQL databases.

CloudKit – now that’s how to do a secure Database for users

Data Breach Hand Brick Wall Computer

One of the big controversies here relates to the appetite of the current UK government to release personal data with the most basic understanding of what constitutes personal identifiable information. The lessons are there in history, but I fear without knowing the context of the infamous AOL Data Leak, that we are destined to repeat it. With it goes personal information that we typically hold close to our chests, which may otherwise cause personal, social or (in the final analysis) financial prejudice.

When plans were first announced to release NHS records to third parties, and in the absence of what I thought were appropriate controls, I sought (with a heavy heart) to opt out of sharing my medical history with any third party – and instructed my GP accordingly. I’d gladly share everything with satisfactory controls in place (medical research is really important and should be encouraged), but I felt that insufficient care was being exercised. That said, we’re more than happy for my wife’s Genome to be stored in the USA by 23andMe – a company that demonstrably satisfied our privacy concerns.

It therefore came as quite a shock to find that a report, highlighting which third parties had already been granted access to health data with Government mandated approval, ran to a total 459 data releases to 160 organisations (last time I looked, that was 47 pages of PDF). See this and the associated PDFs on that page. Given the level of controls, I felt this was outrageous. Likewise the plans to release HMRC related personal financial data, again with soothing words from ministers in whom, given the NHS data implications, appear to have no empathy for the gross injustices likely to result from their actions.

The simple fact is that what constitutes individual identifiable information needs to be framed not only with what data fields are shared with a third party, but to know the resulting application of that data by the processing party. Not least if there is any suggestion that data is to be combined with other data sources, which could in turn triangulate back to make seemingly “anonymous” records traceable back to a specific individual.Which is precisely what happened in the AOL Data Leak example cited.

With that, and on a somewhat unrelated technical/programmer orientated journey, I set out to learn how Apple had architected it’s new CloudKit API announced this last week. This articulates the way in which applications running on your iPhone handset, iPad or Mac had a trusted way of accessing personal data stored (and synchronised between all of a users Apple devices) “in the Cloud”.

The central identifier that Apple associate with you, as a customer, is your Apple ID – typically an email address. In the Cloud, they give you access to two databases on their cloud infrastructure; one a public one, the other private. However, the second you try to create or access a table in either, the API accepts your iCloud identity and spits back a hash unique to your identity and the application on the iPhone asking to process that data. Different application, different hash. And everyone’s data is in there, so it’s immediately unable to permit any triangulation of disparate data that can trace back to uniquely identify a single user.

Apple take this one stage further, in that any application that asks for any personal identifiable data (like an email address, age, postcode, etc) from any table has to have access to that information specifically approved by the handset owners end user; no explicit permission (on a per application basis), no data.

The data maintained by Apple, besides holding personal information, health data (with HealthKit), details of home automation kit in your house (with HomeKit), and not least your credit card data stored to buy Music, Books and Apps, makes full use of this security model. And they’ve dogfooded it so that third party application providers use exactly the same model, and the same back end infrastructure. Which is also very, very inexpensive (data volumes go into Petabytes before you spend much money).

There are still some nuances I need to work. I’m used to SQL databases and to some NoSQL database structures (i’m MongoDB certified), but it’s not clear, based on looking at the way the database works, which engine is being used behind the scenes. It appears to be a key:value store with some garbage collection mechanics that look like a hybrid file system. It also has the capability to store “subscriptions”, so if specific criteria appear in the data store, specific messages can be dispatched to the users devices over the network automatically. Hence things like new diary appointments in a calendar can be synced across a users iPhone, iPad and Mac transparently, without the need for each to waste battery power polling the large database on the server waiting for events that are likely to arrive infrequently.

The final piece of the puzzle i’ve not worked out yet is, if you have a large database already (say of the calories, carbs, protein, fat and weights of thousands of foods in a nutrition database), how you’d get that loaded into an instance of the public database in Apple’s Cloud. Other that writing custom loading code of course!

That apart, really impressed how Apple have designed the datastore to ensure the security of users personal data, and to ensure an inability to triangulate data between information stored by different applications. And that if any personal identifiable data is requested by an application, that the user of the handset has to specifically authorise it’s disclosure for that application only. And without the app being able to sense if the data is actually present at all ahead of that release permission (so, for example, if a Health App wants to gain access to your blood sampling data, it doesn’t know if that data is even present or not before the permission is given – so the app can’t draw inferences on your probably having diabetes, which would be possible if it could deduce if it knew that you were recording glucose readings at all).

In summary, impressive design and a model that deserves our total respect. The more difficult job will be to get the same mindset in the folks looking to release our most personal data that we shared privately with our public sector servants. They owe us nothing less.

Programming and my own sordid past

Austin Maestro LCP5

Someone asked me what sort of stuff i’ve programmed down my history. I don’t think i’ve ever documented it in one place, so i’m going the attempt a short summary here. I even saw that car while it was still in R&D at British Leyland! There are lots of other smaller hacks, but to give a flavour of the more sizable efforts. The end result is why I keep technically adept, even though most roles I have these days are more managerial in nature, where the main asset attainable is to be able to suss BS from a long distance.

Things like Excel, 1-2-3, Tableau Desktop Professional and latterly Google Fusion Tables are all IanW staples these days, but i’ve not counted these as real programming tools. Nor have I counted use of SQL commands to extract data from database tables directly from MySQL, or within Microsoft SQL Server Reporting Services (SSRS), which i’ve also picked up along the way. Ditto for the JavaScript based UI in front of MongoDB.

Outside of these, the projects have been as follows:

JOSS Language Interpreter (A Level Project: PAL-III Assembler). This was my tutors University project, a simple language consisting of onto 5 commands. Wrote the syntax checker and associated interpreter. Didn’t even have a “run” command; you just did a J 0 (Jump to Line Zero) to set it in motion.

Magic Square Solver (Focal-8). Managed to work out how to build a 4×4 magic square where every row, column, diagonals and centre four squares all added up to the same number. You could tap any number and it would work out the numbers for you and print it out.

Paper Tape Spooler (Basic Plus on RSTS/E). My first job at Digital (as trainee programmer) was running off the paper tape diagnostics my division shipped out with custom-built hardware options. At the time, paper tape was the universal data transfer medium for PDP-8 and PDP-11 computers. My code spooled multiple copies out, restarting from the beginning of the current copy automatically if the drive ran out of paper tape mid-way through. My code permitted the operator to input a message, which was printed out in 8×7 dot letter shapes using the 8 hole punch at the front of each tape – so the field service engineer could readily know what was on the tape.

Wirewrap Optimiser (Fortran-11 on RSX-11M). At the time my division of DEC was building custom circuit boards for customers to use on their PDP-8 and PDP-11 computers, extensive use was made of wire-wrapped backplanes into which the boards plugged into the associated OmniBus, UniBus or Q-Bus electronics. The Wirewrap program was adapted from a piece of public domain code to tell the operator (holding a wirewrap gun) which pins on a backplane to wire together and in what sequence. This was to nominally minimise the number of connections needed, and to make the end result as maintainable as possible (to avoid having too many layers of wires to unpick if a mistake was made during the build).

Budgeting Suite (Basic Plus on RSTS/E). Before we knew of this thing called a Spreadsheet (it was a year after Visicalc had first appeared on the Apple ][), I coded up a budget model for my division of DEC in Basic Plus. It was used to model the business as it migrated from doing individual custom hardware and software projects into one where we looked to routinely resell what we’d engineered to other customers. Used extensively by the Divisional board director that year to produce his budget.

Diagnostics (Far too many to mention, predominantly Macro-11 with the occasional piece of PAL-III PDP-8 Assembler, standalone code or adapted to run under DEC-X/11). After two years of pushing bits to device registers, and ensuring other bits changed in sync, it became a bit routine and I needed to get out. I needed to talk to customers … which I did on my next assignment, and then escaped to Digital Bristol.

VT31 Light Pen Driver in Macro-11 on RSX-11M. The VT31 was a bit mapped display and you could address every pixel on it individually. The guy who wrote the diagnostic code (Bob Grindley) managed to get it to draw circles using just increment and decrement instructions – no sign of any trig functions anywhere – which I thought was insanely neat. So neat, I got him to write it up on a flowchart which I still have in my files to this day. That apart, one of our OEM customers needed to fire actions off if someone pressed the pen button when the pen was pointing at a location somewhere on the screen. My RSX-11M driver responded to a $QIO request to feed back the button press event and the screen location it was pointing at when that occured, either directly, or handled as an Asynchronous System Trap (AST in PDP-11 parlance). Did the job, I think used in some aerospace radar related application.

Kongsberg Plotter Driver (Press Steel Fisher, Macro-11 on RSX-11M). Pressed Steel Fisher were the division of British Leyland in Cowley, Oxford who pressed the steel plates that made Austin and Morris branded car bodies. The Kongsberg Plotter drew full size stencils which were used to fabricate the car-size body panels; my code drove the pen on it from customers own code converted to run on a PDP-11. The main fascination personally was being walked through one workshop where a full size body of a yet announced car was sitting their complete. Called at that stage the LCP5, it was released a year later under the name of an Austin Maestro – the mid range big brother to the now largely forgotten Mini Metro.

Spanish Lottery Random Number Generator (De La Rue, Macro-11 on RSX-11M). De La Rue had a secure printing division that printed most of the cheque books used in the UK back in the 1980’s. They were contracted by the Spanish Lottery to provide a random number generator. I’m not sure if this was just to test things or if it was used for the real McCoy, but I was asked to provide one nonetheless. I wrote all the API code and unashamedly stole the well tested random generator code itself from the sources of single user, foreground/background only Operating System RT-11. A worked well, and the customer was happy with the result. I may have passed up the opportunity to become really wealthy in being so professional 🙂

VAX PC-11 Paper Tape Driver (Racal Redac, Thorn EMI Wookey Hole, others, Macro-32 on VAX/VMS). Someone from Educational Services had written a driver for the old PC11 8-hole Paper Tape Reader and Punch as an example driver. Unfortunately, if it ran out of paper tape when outputting the blank header or trailer (you had to leave enough blank tape either end to feed the reader properly), then the whole system crashed. Something of an inconvenience if it was supposed to be doing work for 100’s of other users at the same time. I cleaned up the code, fixed the bug and then added extra code to print a message on the header as i’d done earlier in my career. The result was used in several applications to drive printed circuit board, milling and other manufacturing machines which still used paper tape input at that stage.

Stealth Tester, VAX/VMS Space Invaders (British Aerospace, VAX Fortran on VAX/VMS). Not an official project, but one of our contacts at British Aerospace in Filton requested help fixing a number of bugs in his lunchtime project – to implement space invaders to work on VAX/VMS for any user on an attached VT100 terminal. The team (David Foddy, Bob Haycocks and Maurice Wilden) nearly got outed when pouring over a listing when the branch manager (Peter Shelton) walked into the office unexpectedly, though he left seemingly impressed by his employees working so hard to fix a problem with VAX Fortran “for BAE”. Unfortunately, I was the weak link a few days later; the same manager walked into the Computer Room when I was testing the debugged version, but before they’d added the code to escape quickly if the operator tapped control-C on the keyboard. When he looked over my shoulder after seeing me frantically trying to abort something, he was greeted by the Space Invaders Superleague, complete with the pseudonyms of all the testers onboard. Top of that list being Flash Gordon’s Granny (aka Maurice Wilden) and two belonging to Bob Haycocks (Gloria Stitz and Norma Snockers). Fortunately, he saw the funny side!

VMS TP Monitor Journal Restore (Birds Eye Walls, Macro-32 on VAX/VMS). We won an order to supply 17 VAX computers to Birds Eye Walls, nominally for their “Nixdorf Replacement Project”. The system was a TP Monitor that allowed hundreds of telesales agents to take orders for Birds Eye Frozen Peas, other Frozen goods and Walls Ice Cream from retailers – and play the results into their ERP system. I wrote the code that restored the databases from the database journal in the event of a system malfunction, hence minimising downtime.

VMS TP Monitor Test Suite (Birds Eye Walls, Macro-32 and VAX Cobol on VAX/VMS). Having done the database restore code, I was asked to write some test programs to do regression tests on the system as we developed the TP Monitor. Helped it all ship on time and within budget.

VMS Print Symbiont Job Logger (Birds Eye Walls, Macro-32 on VAX/VMS). One of the big scams on the previous system was the occasional double printing of a customer invoice, which doubled as a pick list for the frozen food delivery drivers. If such a thing happened inadvertently or on purpose, it was important to spot the duplicate printing and ensure the delivery driver only received one copy (otherwise they’d be likely to receive two identical pick lists, take away goods and then be tempted to lose one invoice copy; free goods). I had to modify the VMS Print Symbiont (the system print spooler) to add code to log each invoice or pick list printed – and for subsequent audit by other peoples code.

Tape Cracking Utilities (36 Various Situations, Macro-32 on VAX/VMS). After moving into Presales, the usual case was to be handed some Fortran, Cobol or other code on an 800 or 1600bpi Magnetic Tape to port over and benchmark. I ended up being the district (3 offices) expert on reading all sorts of tapes from IBM, ICL and a myriad of other manufacturers systems I built a suite of analysis tools to help work out the data structures on them, and then other Macro-32 code to read the data and put them in a format usable on VAX/VMS systems. The customer code was normally pretty easy to get running and benchmarks timed after that. The usual party trick was to then put the source code through a tool called “PME”, that took the place of the source code debugger and sampled the PC (Program Counter) 50 times per second as the program ran. Once finished, an associated program output a graph showing where the users software was spending all its time; a quick tweak in a small subroutine amongst a mountain of code, and zap – the program ran even faster. PME was productised by author Bert Beander later on, the code becoming what was then known as VAX Performance and Coverage Analyzer – PCA.

Sales Out Reporting System (Datatrieve on VAX/VMS). When drafted into look after our two industrial distributors, I wrote some code that consolidated all the weekly sales out reporting for our terminals and systems businesses (distributors down to resellers that bought through each) and mapped the sales onto the direct account team looking after each end user account that purchased the goods. They got credit for those sales as though they’d made the sales themselves, so they worked really effectively at opening the doors to the routine high volume but low order value fulfilment channels; the whole chain working together really effectively to maximise sales for the company. That allowed the End User Direct Account Teams to focus on the larger opportunities in their accounts.

Bakery Recipe Costing System (GW-Basic on MS-DOS). My father started his own bakery in Tetbury, Gloucestershire, selling up his house in Reading to buy a large 5-storey building (including shopfront) at 21, Long Street there. He then took out sizable loans to pay for an oven, associated craft bakery equipment and shop fittings. I managed to take a lot of the weight off his shoulders when he was originally seeing lots of spend before any likely income, but projecting all his cashflows in a spreadsheet. I then wrote a large GW-Basic application (the listing was longer than our combined living and dining room floors at the time) to maintain all his recipes, including ingredient costs. He then ran the business with a cash float of circa 6% annual income. If it trended higher, then he banked the excess; if it trended lower, he input the latest ingredient costs into the model, which then recalculated the markups on all his finished goods to raise his shop prices. That code, running on a DEC Rainbow PC, lasted over 20 years – after which I recoded it in Excel.

CoeliacPantry e-Commerce Site (Yolanda Cofectionery, predominantly PHP on Red Hat Linux 7.2). My wife and fathers business making bread and cakes for suffers of Coeliac Disease (allergy to the gluten found in wheat products). I built the whole shebang from scratch, learning Linux from a book, then running on a server in Rackshack (later EV1servers) datacentre in Texas, using Apache, MySQL and PHP. Bought Zend Studio to debug the code, and employed GPG to encode passwords and customer credit card details (latter maintained off the server). Over 300 sales transactions, no chargebacks until we had to close the business due to ill-health of our baker.

Volume/Value Business Line Mapping (Computacenter, VBA for Excel, MS-Windows). My Volume Sales part of the UK Software Business was accountable for all sales of software products invoiced for amount under £100,000, or where the order was for a Microsoft SELECT license; one of my peers (and his team of Business Development Managers) focussed on Microsoft Enterprise Agreements or single orders of £100,000 or more. Simple piece of Visual Basic for Applications (VBA) code that classified a software sale based on these criteria, and attributed it to the correct unit.

MongoDB Test Code (self training: Python on OS/X). I did a complete “MongoDB for Python Developers” course having never before used Python, but got to grips with it pretty quickly (it is a lovely language to learn). All my test code for the various exercises in the 6 week course were written in Python. For me, my main fascination was how MongoDB works by mapping it’s database file into the address space above it’s own code, so that the operating systems own paging mechanism does all the heavy lifting. That’s exactly how we implemented Virtual Files for the TP Monitor for Birds Eye Walls back in 1981-2. With that, i’ve come full circle.

Software Enabled (WordPress Network): My latest hack – the Ubuntu Linux Server running Apache, MySQL, PHP and the WordPress Network that you are reading words from right now. It’s based on Digital Ocean servers in Amsterdam – and part of my learning exercise to implement systems using Public Cloud servers. Part of my current exercise trying to simplify the engagement of AWS, Google Cloud Services and more in Enterprise Accounts, just like we did for DECdirect Software way back when. But that’s for another day.


“Big Data” is really (not so big) Data-based story telling

Aircraft Cockpit

I’m me. My key skill is splicing together data from disparate sources into a compelling, graphical and actionable story that prioritises the way(s) to improve a business. When can I start? Eh, Hello, is anyone there??

One characteristic of the IT industry is its penchants for picking snappy sounding themes, usually illustrative of a future perceived need that their customers may wish to aspire to. And to keep buying stuff toward that destination. Two of these terms de rigueur at the moment are “Big Data” and “Analytics”. There are attached to many (vendor) job adverts and (vendor) materials, though many searching for the first green shoots of demand for most commercial organisations. Or at least a leap of faith that their technology will smooth the path to a future quantifiable outcome.

I’m sure there will be applications aplenty in the future. There are plenty of use cases where sensors will start dribbling out what becomes a tidal wave of raw information, be it on you personally, in your mobile handset, in lower energy bluetooth beacons, and indeed plugged into the “On Board Diagnostics Bus” in your car. And aggregated up from there. Or in the rare case that the company has enough data locked down in one place to get some useful insights already, and has the IT hardware to crack the nut.

I often see desired needs for “Hadoop”, but know of few companies who have the hardware to run it, let alone the Java software smarts to MapReduce anything effectively on a business problem with it. If you do press a vendor, you often end up with a use case for “Twitter sentiment analysis” (which, for most B2B and B2C companies, is a small single digit percentage of their customers), or of consolidating and analysing machine generated log files (which is what Splunk does, out of the box).

Historically, the real problem is data sitting in silos and an inability (for a largely non-IT literate user) to do efficient cross tabulations to eek a useful story out. Where they can, the normal result is locking in on a small number of priorities to make a fundamental difference to a business. Fortunately for me, that’s a thread that runs through a lot of the work i’ve done down the years. Usually in an environment where all hell is breaking loose, where everyone is working long hours, and high priority CEO or Customer initiated “fire drill” interruptions are legion. Excel, Text, SQLserver, MySQL or MongoDB resident data – no problem here. A few samples, mostly done using Tableau Desktop Professional:

  1. Mixing a years worth of Complex Quotes data with a Customer Sales database. Finding that one Sales Region was consuming 60% of the teams Cisco Configuration resources, while at the same time selling 10% of the associated products. Digging deeper, finding that one customer was routinely asking our experts to configure their needs, but their purchasing department buying all the products elsewhere. The Account Manager duly equipped to have a discussion and initiate corrective actions. Whichever way that went, we made more money and/or better efficiency.
  2. Joining data from Sales Transactions and from Accounts Receivable Query logs, producing daily updated graphs on Daily Sales Outstanding (DSO) debt for each sales region, by customer, by vendor product, and by invoices in priority order. The target was to reduce DSO from over 60 days to 30; each Internal Sales Manager had the data at their fingertips to prioritise their daily actions for maximum reduction – and to know when key potential icebergs were floating towards key due dates. Along the way, we also identified one customer who had instituted a policy of querying every single invoice, raising our cost to serve and extending DSO artificially. Again, Account Manager equipped to address this.
  3. I was given the Microsoft Business to manage at Metrologie, where we were transacting £1 million per month, not growing, but with 60% of the business through one retail customer, and overall margins of 1%. There are two key things you do in a price war (as learnt when i’d done John Winkler Pricing Strategy Training back in 1992), which need a quick run around customer and per product analyses. Having instituted staff licensing training, we made the appropriate adjustments to our go-to-market based on the Winkler work. Within four months, we were trading at £5 million/month and at the same time, doubled gross margins, without any growth from that largest customer.
  4. In several instances that demonstrated 7/8-figure Software revenue and profit growth, using a model to identify what the key challenges (or reasons for exceptional performance) were in the business. Every product and subscription business has four key components that, mapped over time, expose what is working and what is an area where corrections are needed. You then have the tools to ask the right questions, assign the right priorities and to ensure that the business delivers its objectives. This has worked from my time in DECdirect (0-$100m in 18 months), in Computacenter’s Software Business Units growth from £80-£250m in 3 years, and when asked to manage a team of 4, working with products from 1,072 different vendors (and delivering our profit goals consistently every quarter). In the latter case, our market share in our largest vendor of the 1,072 went from 7% UK share to 21% in 2 years, winning their Worldwide Solution Provider of the Year Award.
  5. Correlating Subscription Data at Demon against the list of people we’d sent Internet trial CDs to, per advertisement. Having found that the inbound phone people were randomly picking the first “this is where I saw the advert” choice on their logging system, we started using different 0800 numbers for each advert placement, and took the readings off the switch instead. Given that, we could track customer acquisition cost per publication, and spot trends; one was that ads in “The Sun” gave nominal low acquisition costs per customer up front, but were very high churn within 3 months. By regularly looking at this data – and feeding results to our external media buyers weekly to help their price negotiations – we managed to keep per retained customer landing costs at £30 each, versus £180 for our main competitor at the time.

I have many other examples. Mostly simple, and not in the same league as Hans Rosling or Edward Tufte examples i’ve seen. That said, the analysis and graphing was largely done out of hours during days filled with more customer focussed and internal management actions – to ensure our customer experience was as simple/consistent as possible, that the personal aspirations of the team members are fulfilled, and that we deliver all our revenue and profit objectives. I’m good at that stuff, too (ask any previous employer or employee).

With that, i’m off writing some Python code to extract some data ready ahead of my Google “Making Sense of Data” course next week. That to extend my 5 years of Tableau Desktop experience with use of some excellent looking Google hosted tools. And to agonise how to get to someone who’ll employ me to help them, without HR dissing my chances of interview airtime for my lack of practical Hadoop or MapR experience.

The related Business and People Management Smarts don’t appear to get onto most “Requirements” sheet. Yet. A savvy Manager is all I need air time with…

The “M” in MOOC shouldn’t stand for “Maddening”

Mad man pulling his hair out in Frustration

There was a post in Read/Write yesterday entitled “I failed my online course – but learned a lot about Education”: full story here. The short version is that on her Massive Open Online Course, the instructor had delegated out the marking of essays to fellow students on the course, 4/5 of which had unjustifiably marked an essay of hers below the pass mark. With that, the chance of completing the course successfully evaporated, and she left it.

Talking to companies that run these courses to over a thousand (sometimes over 100,000) participants, she cites a statistic that only 6.8% of those registering make it through to the end of the course. That said, my own personal exposure to these things comes down to a number of factors:

  1. If the course is inexpensive or free, there will be a significant drop between the number of registrants and the number of people who even invoke the first lesson. Charges (or availability of an otherwise unobtainable useful skill) will dictate a position in each persons time priorities.
  2. The course must go through a worked example of a task before expecting participants to have the skills to complete a test.
  3. Subjective or Ambiguous answers demotivate people and should be avoided at all costs. Further, course professors or assistants should be active on associated forums to ensure students aren’t frustrated by omissions in the course material. You keep students engaged and have some pointers on how to improve the course next time it’s run.
  4. Above all, participants need to have a sense that they are learning something which they can later apply, and any tests that prove that do add weight to their willingness to plough on.
  5. The final test is meaty, aspirational (at least when the course has started) and proves that the certificate at the end is a worthwhile accomplishment to be personally proud of, and for your peers to respect.

I did two courses on MongoDB a year ago, one “MongoDB for Python Programmers”, the other “MongoDB for DBAs” (that’s Database Administrators for those not familiar with the acronym). Their churn waterfall looked to be much less dramatic than the 6.8% completion rate reported in the post; they started with 6,600 and 6,400 registrants respectively in the courses I participated in, and appear to get completion rates in the scale of 19-24% from then and ever since. Hence a lot of people out there with skills to evangelise and use their software.

The only time any of the above hit me was on Week 2 of the Programmers course, which said on the prerequisites that you didn’t need to have experience in Python to complete the course – given it is easy to learn. In the event, we were asked to write a Python program from scratch to perform some queries on a provided dataset – but before any code that did any interaction with a MongoDB database had been shown.

Besides building loop constructs in Python, the biggest gap was how the namespace of variables in Python mapped onto field names within MongoDB. After several frustrating hours, I put an appeal on the course forum for just one small example that showed how things interacted – and duly received a small example the next morning. Armed with that, I wrote my code, found it came out with one of the multiple choice answers, and all done.

I ended up getting 100% passes with distinction in both courses, and could routinely show a database built, sharded and replicated across several running instances on my Mac. The very sort of thing you’d have to provide in a work setting, having had zero experience of NoSQL databases when the course had started 7 weeks earlier. If you are interested in how they set their courses up, there’s plenty of meat to chew at their Education Blog.

MongoDB for Developers Course CertificateMongoDB for DBAs Course Certificate

I did register for a Mobile Web Engineering Course with iversity but gave that up 2 weeks in. This was the first course i’d attended where fellow students marked my work (and me them – had to mark 7 other students work each week). The downfall there was vague questions on exercises that weren’t covered in the course materials, and where nuances were only outlined in lectures given in German. Having found fellow students were virtually universally confused, an absence of explanation from the course professors or assistants to our cries for guidance, and everyone appearing to spend inordinate, frustrating hours trying to reverse engineer what the answers requested should look like, I started thinking. What have I learnt so far?

Answer: How to deploy a virtual machine on my Mac. How to get German language Firefox running in English. What a basic HTML5/Css3 mobile template looked like. And that i’d spent 6 hours or so getting frustrated trying to reverse engineer the JavaScript calls from a German language Courseware Authoring System, without any idea of what level of detail from the function calling hierarchy was needed for a correct answer in our test. In summary, a lot of work that reading a book could have covered in the first few pages. With that, I completed my assignment that week as best I could, marked the 7 other students as per my commitments that week, and once done, deregistered from the course. I’ve bought some O’Reilly books instead to cover Mobile App Development, so am sure i’ll have a body of expertise to build from soon.

Next week I will be starting the Google “Making Sense of Data” course which looks very impressive and should improve some of analytics and display skills. Really looking forward to it. And given the right content, well engineered like the MongoDB courses, i’m sure Massively Open Online Courses will continue to enhance the skills of people, like me, who are keen to keep learning.

Stand back! I’ve done the Free Online Course


In California, Google run a fleet of driverless Toyota Prius, Lexus RX450h and Audi TT cars. Laws are such that you do need a driver behind the wheel “just in case”, but they’ve done (by April 2013) over 435,000 self driven miles without a single accident to date. Even doing impressive things (if you have a spare 3 minutes, i’d encourage you to watch this video).

Peter Norvig and Sebastian Thrun are two Stanford professors heavily involved at that project at Google. In 2011 and alongside their work at Google, they opened their “Introduction to Artificial Intelligence” course at Stanford University for free – online – to anyone in the world who wanted to complete it. Over 100,000 subscribed, and with it started the Massive Open Online Course (MOOC) industry.

MongoDB used the same structure of online teaching to offer two free courses last year, nominally training people who wanted to program and administer databases using their market leading MongoDB NoSQL database. Armed only with my MacBook Air on my dining table, I joined over 6,600 other hopefuls to do their free, 7 week long, 10 hours/week M101P: MongoDB for Developers course. This included examples in Python, which we learned as part of the syllabus. I also joined over 6,400 other students doing the equivalent M102: MongoDB for DBAs course.

You learn from short videos with frequent knowledge test quizzes each week, up to 10 hours per week per course, but in start/stop gaps around your other work/personal commitments. You then have a set of homework exercises to run on your own machine, which have to be completed and answers posted on their portal within a week of issue. New videos are released every Tuesday morning at 4am UK time, and the matching homework is to be in within a week. At the very end, Week 7, you have a final summary and a final 10 or 11 final exam questions to answer that week.

There is plenty of help on hand from the instructors and a small number of teaching assistants on each courses forum, though many of the queries are answered by fellow students.

Some weeks, it was mad. I was sitting there on my dining room table, five weeks in, with a database split over three different replica sets and multiple shards, all running on my MacBook Air and running very impressively. I just sat there shaking my head at the affront of having the full complexity of the thing I built running in front of me. This from having no experience of MongoDB, JSON syntax, Python code, or of JavaScript at all when I started the course.

I was delighted to have finished both courses with 100% ratings — something achieved by 2.2% of the intake of the programming course, and 5.1% for the DBAs. The company, after 7 weeks, had an extra 9,000 or so professional advocates who’ve passed their exams since they started the previous year (this was the second time they’d been run). I was duly certified:

MongoDB for Developers Course Certificate MongoDB for DBAs Course Certificate

The product itself is very, very impressive, built to scale out as your needs grow. I was no less impressed with the execution of this training on the edX platform, as described eloquently by VP Education Andrew Erlichson on their blog at the time. Anyone looking to do the same courses I did (and now more) can find them at https://education.mongodb.com/

This year, I’ve decided to improve my ability to sift data from databases and to present it in compelling ways. I dislike tables of numbers, even throwing my wife’s heart monitor readings onto a Google Sheets graph for her doctor – not least to show her anxiety of having readings taken calmed back to normality close to the end of the sampling period, something not immediately apparent from the raw data:

Jane Blood Pressure Chart

I’ve been doing similar but business orientated things in Tableau Desktop Professional for over 5 years (exposing underlying trends, sometimes leading to spectacular business results), but I’ve no doubt I’ll learn new and useful techniques with a fresh perspective from Google and the tools they use. To this end, i’ve registered on their free Making Sense of Data online course and am ready to go (part time!) from March 18th until April 4th.

There are plenty of other courses available on a wide range of topics, most free, some with nominal subscription charges. Go have a gander at what’s available at:

So, lots to pique anyones interests, and to keep learning. Which courses are you going to do this year?