Visualizing Data: From Exploration to Crafting Stories

Introduction

Data visualization is the science and art of encoding information using clear, visually appealing graphics. The types of visualizations come in various forms, including:

static infographics
technically complex dashboards
data art

Visualization is a discipline which, relative to the specific forms of visualization used, requires a broad range of both knowledge and skills such as graphic design, UI/UX design, programming, and analysis. For the purposes of this article, we will be focusing on some of the most important phases for building data visualizations in the context of data science and analytics applications.

Data visualization is a tool that is indispensable for gaining meaningful insights while working with large data sets. Even smaller businesses generate large quantities of transactional and operational data daily, and the insights hidden within this data constitute a veritable gold mine. Since these large datasets are collections of figures and tables that are intuitively meaningless for most people, being able to identify trends and patterns from raw data sets can be challenging and time consuming, if not torturous. Encoding information into clear, visually appealing graphical formats, both analysts and non-technical users can easily identify insights from complex data.

Data visualization can be an invaluable addition to any area where large sets of data are present, such as engineering, science, and business. In business, visualization methods can be applied to any area involving analytics including customer trends, operational metrics, and accounting. Also, visualizations provide a means of communicating information related to specific domains to those who may not have much background knowledge of the subject area, which makes for more effective presentations.

In the following sections, we will explore four basic stages of building and developing effective visualizations, culminating with a brief overview of translating data into stories.

Stage 1: Data Wrangling & EDA

The first phase of developing a visualization has two basic steps: Data Wrangling and Exploratory Data Analysis (EDA). Data wrangling is where the data set is initially studied in order to obtain a high level overview of what data the set actually contains, and to formulate ideas about what insights the data might contain. Data wrangling also entails transforming and cleaning the data values, or assessing what other data should be merged into the set. After cleaning and transforming the data, the next step is EDA, which is the most important aspect of the entire process.

Data Wrangling

The data wrangling step can be broken down into three basic parts: Discovery, Transformation, and Validation. Below we’ll look at each part in more detail, and will use Jupyter Notebook, Python, and Pandas as the basic set of tools for providing a practical point of reference.

Discovery – Discovery involves analyzing data in a more raw state in order to establish a more firm grasp on what data exists in the set, what data types define each of the columns, and to get an idea if more data will need to be added in order to satisfy business requirements. In this stage, the analyst will, using the tools specified above, load the dataset into a Pandas dataframe and begin taking note of what data types are specified for each column and possibly run various operations, like getting all unique in the set for columns featuring categorical data.
Transformation – This part of the process is vital, and involves structuring, normalizing, denormalizing, and cleaning the data. The primary goal for this process is to smooth out, rectify potential errors in the data, and structure the data so that during the exploratory and later phases the data will be much clearer and easier to process and understand. Such errors might include missing values, extreme outliers, bad data values, and formatting issues like unnecessarily lengthy floating decimals. The Pandas library features many built in methods for performing essentially all of these functions.
Validation – Following the process of transformation, the analyst or team will need to verify all of the data to ensure steps taken during the transformation process did not distort the data in a way that could yield inaccurate or misleading insights.

Exploratory Data Analysis

Once the data has been properly cleaned, normalized, structured, and verified the dataset can be further explored in a tool like Jupyter Notebook or imported into another tool like Tableau. EDA is ultimately an iterative process, and will be utilized from this point until the last phase. The very goal of this entire process – the resulting visualization or dashboard – may itself be an EDA tool. If compiling visualizations for a static presentation or infographic, then this process is what will ultimately yield the desired assets.

This step will intermingle somewhat with the next phase of chart selection, since the analyst will likely be running the data through different displays and default charting features. In the initial phase, the goal is to simply explore the data using default, out of the box charts and methods, and only during later phases will actual chart selection, design decisions, and story development take precedence.

Stage 2: Choosing the Right Visualization

Different Chart Types

The specific types of charts used to display information are determined by the data types being expressed and the general insights which are intended to be gained by the user. If the intent of the chart is to display trends over time, then a bar chart may not be the best choice. Rather, a line chart would be advised since it more clearly conveys the insights desired. Likewise, using a line chart to display categorical data would not be effective, but rather a bar chart would be more effective at allowing the viewer to easily make intuitive comparisons.

Below are a few chart types and their associated data types:

Bar Charts – Generally express a single variable across multiple categories. The most ubiquitous chart type, it uses primarily length or height to convey differences.
Line Charts – Shows a single variable across a time series. If there are multiple categories, then displaying multiple, differently colored lines would be an effective strategy for conveying information
Pie Charts – Used primarily to display the ratio between various categories in terms of percentages. These effectively use the visual characteristics of area and color to convey information.
Node-Graph (Network) Charts – Used to display relationships (edges) between different entities (nodes) within a dataset. Color, size (radius for nodes and thickness for edges), and proximity are all visual characteristics that can be used to enhance the richness of this visualization.
Choropleth Maps – used to display geographical data. A variety of visual characteristics can be used to convey information here: color to display diverging values (hot vs. cold) and categorical data, nodes and radius to mark cities and convey population size, and time to convey geographical changes over time.

Understanding your audience

As with any form of presentation skills, understanding your audience and the context you’ll be presenting in is a key component to successful chart selection. If you are creating a more immersive visualization experience for users to use in a web application, then more complex, interactive visualizations are extremely useful. If the visualization is intended to augment a spoken presentation to a large crowd, then more simple chart types like pie, line, and bar charts are most desirable since the users will be able to quickly grasp the intended meaning of the visualization.

Stage 3: Design Decisions

Now that the appropriate chart type(s) have been selected, the next step is to make design decisions that will augment the patterns and relationships existent within the data. This section will first explore general design principles relevant to visualization, the importance of prudence in the application color, the value of reducing chart junk for enhancing clarity, and finally a few examples of common visualization pitfalls. It is important to note that the sections below are in their application interwoven and are not exclusively distinct principles, stressing the fact that data visualization integrates aspects of both science and art.

General Design Principles

The primary goal of visualization design is to optimize the amount of information that the user can process in the shortest possible time. While aesthetically pleasing and interesting visuals can be desirable, they are not the primary goal and too much artistic embellishment can actually distract and fatigue the user. Simple, clear, and effective encoding should be the priority considerations when choosing design principles, with a focus on intuitive user interaction design in the case of more interactive visualizations and dashboards.

Clarity

Only the data needed for the visualization should be encoded for display, especially in top-level views, and only the elements that are essential to visualizing the data, or subtly enhancing the primary elements should be incorporated. Following this principle will enhance the clarity of the visualization by reducing visual noise, which is also often referred to as chart junk. The presence of visual noise proportionally increases the cognitive load experienced by the user, demanding more effort from the user to extract the desired insights while inhibiting the user’s ability to filter potentially false interpretations. This will also have the effect of making the visualization unpleasant to use, especially in cases such as dashboards where regular situational awareness is the primary use case of the visualization(s).

Filtering and Drill-down Capabilities

Related to the above design principle, using filtering and leveraging the ability to drill-down into data helps to keep visualizations from overloading the user. Enabling filtering gives the user the versatility to only view and compare areas of specific interest. Having a clear birds-eye view of the data with drill-down capabilities will allow the user to zero in on areas of interest, and explore the data characteristics in greater detail without any sort of overload in the initial view.

Aesthetics and Beauty

While artistic embellishments should never be prioritized over encoding information with clarity, creating aesthetically appealing visualizations can increase user engagement and appeal. Clarity in visualization helps to reduce cognitive the cognitive load of the user, while an aesthetically pleasing interface can incentivize the user to look at the visualization for longer periods of time.

Basic UI Best Practices

For visualizations that are intended be be interactive, creating an intuitive interface for users will reduce cognitive load. This form of cognitive noise has been referred to as interface noise, and is often characterized by poor layouts, weird or unexpected interactions, and difficulty navigating to the desired views or information to point out a few examples.

Color

While color could fall under general design principles, in data visualization it deserves special attention since its use and misuse can make the difference between a truthful, impactful visualization and a misleading, noisy one. Color is often used to highlight important or interesting data, and one strategy for leveraging color is to use softer, more neutral colors for normal data and chart elements like axes and annotations, while reserving bolder colors for highlighting values of interest to the user. Too much uniformity in color properties can risk overloading the user in the case of intense colors or missing the opportunity to quickly highlight important information that might be of interest to the user quickly.

Color choices are also a major concern in terms of accessibility. There are many users who may suffer from varying degrees of color blindness, and in those cases color variations should not only be chosen based on the hue, but also in terms of contrast. High contrast colors should be used as a means to highlight important insights in the visual, and the the effects of this strategy are noticeable to both average and color blind persons as well.

Text

Labels, annotations, and tooltips are all useful means of aiding user’s understanding of the features and context of the data. Text in visualizations should follow the same essential guidelines as with good UI principles, such as using hierarchy in font size and font weight to help lead the eye of the user to the most important data first, and using text color to highlight important terms. Also, the font family should be very easy to read in the context the visualization will be presented, again with the purpose of reducing cognitive load and strain in the users.

Annotations, labels, and tooltips are all vital uses of text to provide context for visualizations. Positioning for these elements should be as close to the elements they describe as is possible without cluttering the visual. Repeating units of measurements liberally throughout the visual can also reinforce context for the user, thereby reducing cognitive load. Tooltips are a great approach for defining the characteristics of different data elements without cluttering the presentation, but are generally only applicable in the context of an exploratory and interactive visualization.

Stage 4: Putting It All Together – Storytelling

The previous phases are adequate for the development of basic exploratory visualizations and dashboard development, but what if the end goal is to communicate actionable insights to users who may be unfamiliar with the technical aspects of the data or business generally? This scenario is where incorporating the above data and the key insights gained from exploratory analysis into stories becomes essential. People naturally learn most effectively through narrative, so composing the above data visualizations and their insights into an appealing, easy-to-consume format is the most intuitive way to persuade the intended audience.

Below is an overview of a few of the principles which are essential to developing narratives using data and the insights gleaned from exploratory analysis. Since this section incorporates many of the principles discussed in the previous sections such as clarity, knowing your audience, and choosing the correct visualizations, only principles exclusively relevant to this phase will be discussed.

Express a Clear Message

The first step in creating a data story should be to clearly define the key insights and takeaways that need to be presented, and It’s ideal to focus on one central concept or insight when developing the story. A focused presentation helps to create a more coherent and persuasive argument without overwhelming the audience; so all data, visualizations, and text should be limited to elements that reinforce the central theme.

Similar to the design principle of clarity discussed above, having a consistent visual theme and reducing visual clutter is essential to conveying a clear message. Using clear and simple language without overusing jargon is also key to maintaining clarity. Where specific technical terms are used, there should be an accompanying definition and explanation of the term unless the primary the primary audience are also technically proficient in the subject discussed.

Establish Context

Presenting a collection of data points without also framing the data in context will ultimately provide little value to the intended audience. Defining a central theme and set of insights, as discussed above, is one of the primary steps for establishing context. There must be a “purpose” to the presentation in order for it to convey any real sense of meaning to the audience.

Another key aspect of establishing context is to consider what background information may not be apparent to the audience, and to include that information in the presentation. Such information may include the historical context of the information, assumptions that may underpin the initial exploration of the data or that may be implicitly held by the audience, and an explanation of the features of the dataset itself. Incorporating this background information is essential to providing a “why” to the presentation.

Presenting the data so that is relative to a benchmark or other type of comparative metric is an important strategy for establishing context. Comparisons by industry average and using KPIs are common means of providing context to data, and are often some of the more effective metrics for providing relative context to the audience.

Develop a Narrative Arc

Establishing context effectively fulfills the first part of creating a narrative arc. Once the context has been defined, the narrative can lead the audience from more generalized observations, to specific insights, then finally the key insight. Once the key insight, the conclusion of the presentation, has been revealed is when a call to action should be presented.

A few strategies for improving story engagement and effectiveness are:

Showing is better than telling – A picture is worth a thousand words, and this adage is just as, if not more true, for visualizations as fine art. Clear, intuitive visualizations should be used to convey the insights, and in this context more novel visualization types should be used with caution. To use another quippy observation, it’s often said that a joke that has to be explained is a bad joke, and so it is with choosing visualizations.

Infuse with emotional appeal – Whenever possible, insights should be framed in a way that is relatable to the target audience, aligning the message with their likely desires and concerns. Stories that have emotional impact are generally more engaging and memorable.

Make use of metaphor and analogy – The use of analogy and metaphor are time-tested devices for framing more abstract concepts in more relatable terms. Using these devices will help users to begin making mental connections, reinforcing the presentation message and enhancing retention of the insights.

Build suspense and anticipation – The story should be structured to encourage anticipation of the final conclusion in the audience. Each step of the middle section should serve to build up to the final conclusion, such that the audience almost has the experience of coming to the conclusion themselves.

Conclusion

Data visualization is an indispensable technique in the process of transforming raw data into actionable insights and making impactful presentations. This article has presented a generalized overview of the processes for this transformation, from wrangling and exploration to storytelling, with a focus on how visualization plays an important role at each step. Exploratory analysis can be likened to a repeated process of distillation, where visualizations, insights, and objectives are refined as new patterns are recognized with each recurring exploration.

A few general themes are recurring throughout this process: knowing the intended audience, clarity, and context. Anticipating the intended audience and the context in which the visualization will be used are key to making important decisions about visualization design. Context includes both the environmental context (whether the visualizations are part of a dashboard or presentation) and the narrative context. Narrative context is best established through presenting a clear, focused message with well-defined problems, while environmental context informs design decisions. Clarity in design and message is also important in terms of reducing cognitive load in the audience, which in turn supports more engaging and impactful designs.