Introduction and Recap
We have recently been extracting a series of articles from a white-paper we wrote around how to develop a data-exploration application for a general audience, without hobbling users with more sophisticated needs (and even having to cater to these requests after initial development).
In the previous article we covered the basic options; your first ports-of-call with the biggest bang-for-buck and simplicity of implementation. Occasionally though these will not be enough, so in this instalment we will examine some more advanced options.
As always, the option you go with is decided by weighing up requirements, cost considerations, security issues, and so on.
Remote Console Access
On some rare occasions it is not enough for the user to query the data from their desktop; they need more direct access, on the server itself. This is typically the case where the data sets are very large even after querying, and the requirements include custom processing instead of just querying.
If this is the case, it might be most beneficial to give the user access to the server and resources directly. This can be done via terminal software such as SSH or remote desktop (specialist technologies like VirtualGL mean that even sophisticated interactive visualisations can feasibly be performed this way), or hybrid solutions such as IPython Notebook. The latter is a browser-based console that provides the full power of programming languages such as Python (but despite the name, not limited to Python! A recent collaborative project involve the developers as well as Google and others is focused on extending to many languages commonly used in analysis). The notebook is particularly popular with scientific users, as it is a good way to ensure reproducible research. It is also easy to provide sample code in a notebook, to get people started quickly.
Security is obviously paramount, as the user now has direct access to the server itself! A combination of careful permissions, and modern technologies such as containers and virtualisation is usually required.
Preconfigured Cloud VMs and Data Stores
Cloud services provide an ideal mechanism for storing data and configurations for reuse and analysis. Templates for virtual machines are initialised with software (and possibly data, most likely on a common network drive), and when a user needs a particular suite of tools they simply spin up an appropriate instance. This approach is gaining popularity in the research sector, where tools are often specialised (not to mention, frequently tricky to install). If the data set is quite large, or needs specialised tools to access it (or indeed if your application already runs in the cloud, on a service such as AWS), it might be worth looking into an option like this. Depending on the software installed and the means of accessing it, there is plenty of similarity with remote console access (above).
Container technology is an interesting wrinkle on this, making it very easy to share configurations with other users in a lightweight fashion.
This has a very different cost structure to the typical “sunk cost” or “monthly rental” costs we usually think about with dedicated server resources. It’s often cheap but you pay by the compute hour for analysis, plus a typically modest fee for data transfer and storage. For sporadic but resource-intensive jobs in particular that can make a cloud-based on-demand option quite attractive.
Depending on the sensitivity of the data and any legislation which applies, hosting data in the cloud may or may not be appropriate. It’s certain that this is going to be growing area and that business attitudes and practices will evolve to embrace cloud based services into the future.
While not exhaustive, these last two articles give you a good overview of the options available. The important thing to remember is that any solution involves trade-offs. Sometimes the right combination is obvious, but occasionally you need to weigh various considerations more carefully. In future instalments we will examine some of these considerations in general, as well as real-life case-studies.
If you would like to talk with us further about any of these issues, or for a copy of the white-paper, please drop us a line any time.
I’ve been reading some trade articles about Business Intelligence and how businesses are benefitting. The articles in question were targeting large institutions such as banks but they raise issues which apply more broadly, especially as the market evolves and tools becoming more accessible.
Firstly let’s consider a couple of key terms. If all of the information stored in your various business systems flowed back to a single location you would call it a Data Warehouse. And if you had tools which allowed you to query and report on that data you would have a Business Intelligence solution.
You can consider these as two halves of the same equation: what you pull into your data warehouse you can take out via business intelligence tools.
So what does business life look like without this in place? Actually it probably looks like most businesses today, with data stored in various different applications and databases.
You might, for example, have different system for sales, customer support, accounting and managing your business processes. It would be typical to have a few indispensable excel spreadsheets about too.
You probably manage this business with your sleeves rolled up. To make this all work you come to accept a certain amount of time-consuming manual processes to copy data between systems, generate reports, and keep everything in sync. Psychologically, you’ve probably given up asking hard questions of your data too because it always ends up being a mess of data management and ultimately produces results you just don’t trust.
Staff within individual business units are happy with their tools but that doesn’t help when it comes to getting a full view of your operations.
You probably got to this point for very good reasons. Over time you chose best of breed tools for specific applications and when things didn’t quite work you adapted and accepted that some manual processing was required.
The problem is that the left hand doesn’t know what the right hand is doing. Data is disparate and trapped in different places. As a result you don’t have an easy way to query it for reporting.
Business intelligence promises to resolve these issues by establishing processes which gather data from these operational systems and collect it up in a central location, a data warehouse, so that from then on you have a consistent and well-organised set of data about your whole organisation, ready for reporting via business intelligence tools.
Once established it’s going to be much quicker to answer business questions based on current information. This means more people in the business will be able to interrogate this data, and as a result you can encourage a culture of data-driven decision-making.
The labour equation changes too, as there will be less time spent gathering / processing data and more spent analysis and reporting on it.
As a business leader you’ll be able to see how your business is performing by establishing dashboards which track performance indicators and status reports which stay fresh too.
Please get in touch if you’d like to discuss further. This article hasn’t covered how you might implement a business intelligence solution, nor have we talked about some of the key opportunities it can unlock. We might save these for another day.
A few weeks ago we started looking at the frequently-encountered balancing act around implementing a data-driven application that:
We started by diving into the circumstances behind this situation; in this installment we’ll begin examining some concrete options for addressing it. If you would like a copy of the full whitepaper on the topic, please drop us a line.
The options that we’ll cover in this post should be the first ones considered, due to the completeness of access, and (relative) simplicity of implementation. Naturally, the full decision needs to also consider the users’ exact needs and technical sophistication, security and privacy considerations, and so on.
Read-only Database Access
In the most straight-forward scenario, we simply provide select users with direct access the underlying database.
This approach is best-suited to applications that explore a relatively static data set, and which don’t require specialised computational requirements. Happily, these constraints still describe the vast majority of applications today.
Security can be addressed by granting only read-only access, and if necessary sensitive data can be protected by the use of views and careful permissions.
It does assume that the user is willing and able to understand the schema, and of course knows how to construct database queries and so-on. It’s common for database administrators to reduce the complexity for analysts by preparing simplified views of the data for use in their queries.
Many tools are available for querying the data—even office tools such as Microsoft Excel—but it does require a certain amount of sophistication from the user.
A related approach is instead to provide access to an API.1 The API
can be simple low-level data access, in a similar fashion to that
described above, but it also has the flexibility to specify
computational access and other higher-level actions. For example,
instead of constructing SQL that joins over several tables to
construct a list of customers and recent purchases, the API might
all-customers method which does the same thing with more
convenience. Similarly, instead of retrieving this list and plotting
customer addresses on a map, the API might even offer a
Security and other concerns are now much easier to address, since there is an extra layer between the user and the resources. On the flip side, designing an API that does not unduly restrict users can be more difficult.
There is a subtle benefit to this approach that is easy to overlook: it can be quite cost-efficient. This can occur when the same API that is given to the users is also used to the build the main application in the first place. In this tactic much of the application’s logic and presentation is implemented directly in the browser, which makes calls to the API to retrieve data and invoke actions as necessary. This approach to application development has a number of advantages, not least of which is it makes it much easier to address multiple platforms (browser, iPhone, Android, …) since only the interface needs to be ported.
When users need further access, they are simply given the documentation for the API, and no further work needs to be done.
The main disadvantage is it requires the user to be proficient with some programming language to access the API, but if they are already using R (for example) to perform statistical analysis they should find that sample code is enough to get started.
There’s also the risk that poorly designed scripts will place an unreasonable load on the server—this is particularly a problem where the API undertakes resource-intensive computations for users. There are various techniques which address these issues ranging from quotas and thresholds, isolating API services to dedicated resources, and careful design of API processes.
We have examined the simplest options available, which should also be the default approaches. While they are simple to implement (database access can be granted after development without being planned for, and even modern web techniques often implement an API as a matter of course, without even thinking of other uses for it), they also offer the most flexibility for users. Of course, other issues such as security and the technical sophistry of the users should always be considered in tandem.
In our next post in this series we will investigate a few more advanced options for specialist situations. As always, if you would like to talk further about these or any other questions, feel free to get in touch.
Application Programming Interface; a specification for how computers can communicate at a relatively high level. These days it often also implies a network communication, such as from a web browser or smartphone app to a server. ↩
There has been so much focus in the mainstream media and business publications on “big data”, “why you should have a big data strategy”, and so on, that it can be easy to lose sight of what data should actually be used for. This article is a very brief summary of the main categories of queries your business should be making, and how.
A related perspective that is helpful to consider in this context is the business value associated with the time range of a query. From the lowest to highest value, we have the current status (search, operational reporting), historical data (historical reporting, analysis, summaries), and future prospects (predictive analytics, decision strategies). These also have a corresponding increase in implementation complexity.
This tends to be the simplest question to answer. People have been generating reports and querying databases since the 1970s.
Your business is collecting data through a range of operational and management systems. All systems, big and small, provide a range of common reports and access to underlying data in some form or another.
Data warehousing tools help to pre-process this operational data to allow efficient reporting. Business intelligence solutions help to build and share reports, queries and dashboards.
Why did it happen?
Working out “why” tends to require a more exploratory approach. It requires analytical tools suited to ad-hoc queries which let you drill-down and explore data interactively.
“Why” leads to a search for hidden meaning in data. For instance, can we explain what happened by identifying customer segments and visualising their behaviour over time? These insights come in two forms; insights which explain, and insights which predict. Both help business compete effectively.
Business analytics tools provide rich interfaces for exploring data interactively.
What might happen?
Decision-makers look at what has happened in the past and the insights uncovered through analysis to inform plans for the future. They also establish processes to detect adverse events allowing early intervention.
This is the land of dashboards, trends, thresholds, forecasts and modelling. The techniques vary but, at its core, learnings from the past are used to uncover insights about the current situation and make predictions about the future. This is also the richest area of growth in business analytics, as new possibilities are being uncovered all the time (and as new tools are released).
Business analytics and modelling tools help detect and leverage patterns in data which are difficult to uncover, visualise and interpret by hand.
Last week saw the 7th Google IO developers’ conference. While it attracted less media attention than Apple’s own WWDC, it was noteworthy in that it appeared to mark an increase in the seriousness of Google’s cloud offerings.
This might seem like an odd statement—Google after all is the poster child of massive-scale Internet companies—so it’s worth examining their history in the space, and in the context of commercial cloud providers in general.
Firstly, commercial cloud platforms have a range of different options. At the lowest level, you can spin up a virtual machine on request. You are then responsible for installing whatever services you require, keeping software up to date, hardening it for security, and so on. You can spin up more machines as required, and take them down when not required. The primary benefit is its familiarity: you can more or less replicate your physical infrastructure on someone else’s, but with a lot more flexibility with regards to volume. The down-side is that your IT management operations are also almost identical to what they are today; you just don’t have to worry about hardware any more. At the other end of the complexity spectrum is a hosted application environment, where you forget about servers and deploy your application directly to a “container”. Java and .NET environments are usually well supported here, and for common application types this is typically much more attractive.
Somewhere in the middle of this spectrum are hosted services. These include databases, queue services (for connecting different applications together at scale), DNS and caching, and so on. In other words, components used by your application that would often be a separate server in your existing infrastructure. These run the gamut from the familiar (for example, SQL databases) to more cloud-specialised components (such as some queues or specialised data stores), but the common factor is they can be deployed elastically as a service and you don’t need to worry about managing replication or any other aspects of a durable high-availability service.
It is typical to combine these. For example, your web application could be hosted in a managed container, with a hosted queue to distribute processing jobs amongst workers running on virtual machines, and using a managed database.1 This makes it much easier to write a scalable application, without the IT infrastructure complexity, but of course now you are more wedded to one provider as the APIs and specifics of individual services differ from provider to provider.
Amazon were responsible for launching the market, starting in 2006 with their queue service, soon followed by EC2 (virtual machines), and then an explosion of offerings. They are still the largest provider with the most options, and the first choice of start-ups and mature businesses alike. Microsoft made a late entrance with Azure, at first strongly tied to the Windows ecosystem but lately branching out (they now support Linux VMs, for example). Their progress has been marked by a more considered approach to releasing new services, in contrast to the AWS scatter-gun, guided by a focus on supporting specific business use-cases instead of simply introducing as many services as possible.
So, back to Google. Their entrance (relative to their stature as an internet company) was slightly belated in 2008 with the launch of App Engine. App Engine is an application-hosting environment, with infrastructure for creating a very scalable application, but it also involves committing entirely to Google’s approach: their data store, their APIs, etc (to a greater degree than other providers, with no companion virtual machines available for example). Perhaps because of this, uptake was a little lukewarm (note that Google started top-down with an application platform, while Amazon built bottom-up from virtual machines and other components, allowing users to gradually adopt at their own pace).
However, in the last year or so Google have:
To accompany all of that, at the conference last week they unveiled some impressive-looking development tools, and just as interesting from our perspective, complemented their Hadoop offerings with new technologies based on their internal tools for batch and stream processing. All of this is extremely promising for consumers, with the ability to pick the precise technology stack (and price-point) that best suits their needs, and three major players keeping prices low.
We will be examining these developments more closely in the weeks ahead. If you’d like to talk to us about the different providers, how a cloud strategy might fit in with your business (or not) or which provider might be best suited, feel free to drop us a line any time at email@example.com.
In fact, we have recently delivered a sizeable application using almost exactly this design. If NDAs permit, we might write an article about some of the lessons learned. ↩
A question we have been asked a few times, both with Condense and in our previous lives, is around letting advanced users have increased access to the data to do their own analyses. We are preparing a whitepaper on this topic, and we’ll drop a few blog posts with key points along the way. If you would be interested in reading this report, please drop us a line.
When you are implementing an application, you wind up exploring lots of trade-offs around how sophisticated the interface needs to be. Generally it is best to cater to the average user and the average use-case; sometimes the efficiency and user-friendliness of the application, or the budget, determines that advanced options have no place (although this is nearly always preferable to the interface that exposes every option, leaving it borderline-incomprehensible for even the simplest tasks).
We specialise in web applications to facilitate data exploration and analysis, so this trade-off is often more difficult than many domains… there are always further questions you can explore in your data!
Invariably there will be some users (“power users”) who have requirements which can’t feasibly—due to restrictions in technology, budget, UI design, or because their requirements weren’t anticipated—be implemented in the application itself. Sometimes these needs can be predicted and catered for in the application, but more often it is preferable if the design of an application can allow users to interact with it in their own ways. In other cases specific or unforeseen needs must be addressed after the application has already been rolled out.
We can cater to these users by giving them greater access to the underlying data, computational algorithms, etc. Exactly how this access is granted is fairly case-specific, and we’ll look at some possibilities in a future post.
A data analyst’s perspective
Let’s start by considering a simple motivating example: a web application for internal use to manage customer complaints. This is a fairly simple data model that records customer details including location, the category of complaint, and the product they have issues with.
This is a fairly simple “CRUD” application that can be easily and cheaply implemented using any number of modern web frameworks such as .NET MVC, Django, or Ruby on Rails. Basic operations such as listing all reports (with pagination), creating new reports and editing existing ones and so on can all typically be implemented with minimal code, and this basic application serves day-to-day needs well.
Things start getting complicated when a strategic analyst wants to dive in and figure out where they should be spending their budget. Perhaps they first want to group customers by location, or get basic breakdowns of categories. All of these requests, once known, are by themselves probably easy enough to add to the application, although each takes time and costs money, and the application starts to get a little unwieldy.
It only gets more difficult though as the type of analysis increases in complexity. Let’s say that our analyst has gained access to a fancy new sentiment analysis tool from social media, and wishes to use this to see which of their customers have been creating the most unfavourable reports on Twitter and Facebook and might need to be addressed first—now our once-simple web application has to query multiple data sources or third-party APIs, possibly in a time-consuming manner not suited to a web application anyway.
A more mundane but common issue is that a typical analysis process is by definition exploratory in nature. Our analyst will not always just look at a single query and find the answer; they’ll more likely try one query, then try a variant to get a different angle, and depending on the results of those try completely different breakdowns… all of which would have to be built into our application.
To cap it all off, even if we did all this it probably isn’t even making our analyst’s life that much better, who almost certainly has their own favourite tool that they know inside and out.
For example, common data analysis tools include:
In other words, in all likelihood as soon as the analysis gets interesting it is probably best to provide the analyst with the data and let them get to work on their own terms. If necessary, interesting queries or metrics they unearth can then be implemented in the application itself.
A Data Manager’s Perspective
In addition to the consumer of this service, we also need to consult with the data manager, and anyone involved in its operation.
A range of questions need to be addressed, such as:
In future posts we’ll look at a few options for the questions raised here, other considerations, and potentially some case-studies.
If you would like a copy of the final report, based on this series, please let us know.
Do what now?
We love data visualisations. They can help you explore a large amount of data more intuitively, helping you spot trends and relationships quickly that you wouldn’t have noticed in the raw numbers, and they can tell a convincing story once you have the insights. If nothing else, they often simply look damn cool.
Recently we were exploring some historical weather observations data—stay tuned for more possible posts—and (partly for fun) decided to plot the stations themselves. A really handy visualisation for this purpose is a Voronoi Tessellation. It’s probably easiest to start with the result:
(The map is interactive, so you can use the mouse-wheel to zoom, and drag to pan around as normal.)
What’s going on here? A single cell in a Voronoi tessellation captures all the area around a point that is closest to that point and no others. If you look at any individual line, you will notice it is exactly half-way between two points: if you draw all those lines between all neighbouring points, you wind up with a Voronoi tessellation.1
So why would you actually do this? Well, as you can see it’s a nice way of getting an overview of spatial point data, particularly once you start adding extra data—for example, colouring the cells to denote properties such as maximum temperature for a region (the colours in our example are purely random). Another use is as part of a UI; sometimes you want people to click on a point, but small points are easy to miss… by (silently) using the area around the point as the target you make it much easier for people to click. One of the first uses historically was by a physician called John Snow in 1854 to derive his theory about the origins of a cholera outbreak, by outlining regions around water pumps.
Also, they’re just fun.
Well it didn’t take long—our blog has already gone quiet but there are two very good reasons. Firstly, we’re busy with client work which is fantastic. Secondly, and perhaps more telling, is that our clients want to protect their competitive advantage which makes it hard to talk directly about our work.
We expected that our data-oriented projects were going to be commercially sensitive—after all, an information gap can give business an edge over the competition.
Of course, success in business is about more than just knowing more than the opposition. It’s about establishing systems, processes, infrastructure, partnerships and a track record which influences buyer behaviour and delivers a better overall customer experience.
We think that effective business data-use underpins all of these factors.
“Three things that can get you fired from Caesars: Stealing, sexual harassment and running an experiment without a control group” Gary Loveman, CEO Ceasars Entertainment Corporation
NPR’s Planet Money do some great radio articles on the use of data in business; “From Harvard Economist To Casino CEO” is an absolute classic. This next generation CEO has a very data oriented mantra: “three things that can get you fired from Caesars: Stealing, sexual harassment and running an experiment without a control group.” It’s 20 minutes long and a real eye opener.
From our perspective it’s all quite exciting. We have found that our positioning as software developers with data analytics specialisation is valued and our early clients are all helping to build our portfolio (stay tuned) and skills through relevant project work.
I finally investigated a groovy addition to gmail today. Using this feature you can make your emails more convenient, interactive and useful.
In the example above you can see that my tripit email has a “view itinerary” button in the listing. That’s called an action button and Google has documented how it works:
By adding schema.org markup to the emails you send your users, you can make that information available across their Google experience, and make it easy for users to take quick action. Gmail, Google Search and Google Now all already use this structured data.
It’s worth skimming through the different types of actions available. They range from a one click action which doesn’t take the user out of their inbox through to more interactivity such as a review action which would allow users to provide feedback, again, without leaving inbox. You can or also RSVP to something or jump out to a webpage.
Condense is data focused and, as you might expect, these interactions are trackable.
Gmail supports 4 types of actions and 1 interactive card:
- RSVP Action for events
- Review Action for restaurants, movies, products and services
- One-click Action for just about anything that can be performed with a single click
- Go-to Action for more complex interactions
- Flight interactive cards
How it works
Essentially you add some additional tag attributes to your HTML source code. In this case the link in your email which is picked up and rendered by the email client.
Here’s the related HTML from the TripIt email example. I’ve pulled out the style attributes but otherwise copied it here verbatim:
This is essentially a nice to have feature - it improves customer experience but your e-mail campaign will work just fine without it.
I haven’t investigated which other e-mail clients support these features but I don’t doubt that they’re out there.
It’s late January and we are putting final touches on our website. No doubt it will evolve over time but for now we’re quite proud of the structure.
Let me take you on a quick tour.
Right now you’re in the “thinking” section: this is where you’ll discover what we’re actively taking an interest in. The blog will include synthesis and opinion alongside other fresh content related to our industry. We straddle several spheres so not all articles will appeal — we plan evolve a few key subcategories to let people tune in to topics of interest.
Within the thinking section we’ve included active research interests. These are things which push the envelope for us and help us grow as a business. If you see something of interest there please get in touch.
Finally, we’ve added an “inspiration” page where we link out to articles and quotes from around the world which relate to the data-driven revolution. It is primarily focused on bubbling up interesting data-points and less about our commentary.
The other content is much as you would expect from a consulting company. Just enough information to make you want to call us!