This post is driven by some of the recent experiences and research done. Statistics and analysis is an important aspect for any business. Business decisions are made based on various reports and trend analytics. But, are the decision makers really provided the relevant data?
There are some well known and recognized challenges. A lot of articles try to itemize and present these challenges, along with possible ways to mitigate them. Risk management is one of the important aspects relying heavily on data analysis, but a lot of analysis is also done in Sales and Customer Service. Let us look at some of the most common issues.
Data, and in particular the amount of data and the slice of data selected for analysis are important aspects that could drive widely different results.
Starting with the amount of data, this is in most cases an overwhelming task. It relies heavily on the input from various sources, but more on this later on. It also relies, in particular in Sales and Customer Service on the input from users and customers.
From a Sales perspective, historically speaking, salespeople have always been retaining personalized data about their book of business. This trend is not as widespread anymore. With the popularity of various CRM systems, and the value provided, more and more salespeople are starting to leverage the value provided by these platform to their advantage, and it trumps the value of retaining information outside of a system. Yet, there still are personal experiences that greatly influence sales results. The personal connections to specific clients, the experiences that built trust, those are items that, even if somehow captured in a sales system, will not equally translate to a new salesperson replacing an existing one.
From a customer support perspective, the analysis results could easily be trumped by the well known fact that a person will typically be more vocal when an issue arises then when he/she is satisfied with a product. This data, if analyzed incorrectly, could present a bleak image not realistically reported against the entire user or customer base.
As for the slice of data to be analyzed, selecting the batch of data could also result wildly different results. A too small batch will not be representative of trends. A too large batch of data will make it harder to analyze, and require more effort, costs and time. So then, what is to correct batch of data? Hard to say, depending on what you are trying to analyze and the results you expect to see.
And then, as far as expectations, is the data somewhat in line with the expectations? Whether it is, or not, you always have to question the correctness of your results. A peer review process might be in order to validate your findings.
And just another aspect, the collection of data, in real-time or batches, could influence the results and the ability to analyze trends.
Data quality is an important aspect. The input will wildly influence the output. This could be the result of inaccurate input, sometimes driven by data entry. Other issues revolve around asymmetrical data from multiple systems, driven by one or more systems holding slightly out of sync or outdated data. This is sometimes mitigated by a unified system or tightly integrated systems. Again, real-time versus batch integrations will result in possible discrepancies.
Most data analytics projects commence from a requirement to observe trends, but start from specific premises. These assumptions are made based on specific prior experiences, and are highly susceptible to various biases. Assumptions are used to try to restrict a wide range or results, but will hinder the ability to produce an accurate result. A simplistic example is trying to determine possible number combinations for a lottery. An assumption could be made that, when determining possible combinations, you want to eliminate consecutive numbers or sets of numbers that have already been selected. How correct some of these assumptions are is hard to determine, and additional analysis could be required to determine the probability for each of those scenarios. This will result in additional effort, cost and time which sometimes is in short supply.
Data from multiple sources
It is very common for organizations to store various data sets in disparate systems. These sources could be highly disjointed. Manually joining these data points could result in inaccurate results, and is greatly impacted by the ability of the analyst to understand the sources, and the correct relationships.
An example is reporting on the amount of donations collected by a not-for-profit organization. Analyzing the amounts directly, can present a declining picture, especially now during the current times. A totally different picture could be presents when reporting these amounts against a declining membership base, where reporting amounts against membership numbers could still presents an upwards trend. Which is the real picture you might ask? It depends, but how many additional data sources should drive a more realistic result?
Lack of specific data points
It not uncommon to start an analysis process, just to find out that specific data points that would be relevant are not even collected in the first place. In addition, data collection for various data points could have started at different times, resulting in wild variations on trend analysis. Trying to determine these inflection points, and presenting data as impacted by these inflection points might not have been estimated and in scope from the beginning. So then, what do you do? You ignore these, you ask for more time and budget to take these into accounts, etc.
Visual representation of data
As for visual representation, we have seen many examples where visuals are tweaked for a more impactful presentation. Actions like stretching graph, changing the proportion between axis, and overall varying the proportion of objects in a chart can present a much more impactful picture.
Get your math straight
Math can also be a tricky thing to deal with, especially when used in analytics. Take for example the average. Unless specifically mentioned, the average does not present any real data, and can easily be tweaked to make a point. For those that have not dealt with this before, we have three different averages (I believe this is taught in grade six in North America, and easily forgotten by later in life): Arithmetical Average, Median and Mode. Depending on which average you select to present your data based on, your results will be widely different.
The more we analyze what to present, how to present, and how much information we need, we run the risk of hitting analysis paralysis. We will never the able to produce a report if we keep expanding the scope trying to take into consideration all possible inflection points. So, at some point in time, we have to narrow the scope and get a result.
How much should you trust a report? Hard to say. But as a consumer of reports, you should always take the information with a grain of salt. This is why decision makers are paid the big bucks. Oftentimes, they have to cut through the bullshit, ask the right questions, dissect the data presented, understand what’s presented and try to make the best decision with the available data.
For the report makers, data analysts and statisticians, it is our utmost responsibility to explicitly present the findings with as much context as possible, making sure that the consumers of our reports understand what is presented. Don’t just be another sensational media outlet putting out a chart with great visual impact and no context. Your consumers will appreciate honesty more than sensationalism.