Data storytelling is an essential part of data science. It is the art of communicating complex data analysis through a compelling narrative. Data storytelling is critical for organisations that want to make informed decisions. By using data storytelling techniques, we can transform large data sets into meaningful insights that can help decision-makers make informed decisions. In this blog post, we will demonstrate how to tell a story using data visualisation, using a data set of Irish annual tax receipts from 1984 until today.

We’ll use Python to create four versions of a visualisation, each improving on the previous one by focusing more clearly on a compelling narrative and being more visually beautiful.

Version 1

We start by visualising the data as a simple line chart using Python’s Matplotlib library.

Python
import matplotlib.pyplot as plt
import pandas as pd

# Load the data 
df = pd.read_csv(download_dir+"Open Data Tax Receipts Source.csv",
                 header=None, 
                 names=['year', 'month', 'type', 'category', 'amount'])

# Fix a typo
df['category'] = tax_receipts_df['category'].replace('Valued Added Tax', 'Value Added Tax')

# Filter out estimate receipts from actual receipts
df = df.pipe(lambda x:x[x.type!='Profile'])

# Aggregate the tax receipts by year and category
df = df.groupby(["year", "category"])["amount"].sum().unstack(level=1).fillna(0)

# Plot the data as a line chart
fig, ax = plt.subplots()
df.plot.line(figsize=[14, 6], ax=ax)
plt.show()

As a good data scientist, we have dutifully cleaned the data and created a chart that gives us a general idea of the trends of different categories of Irish tax receipts over time.

This might be very useful to the data scientist who is familiar with the dataset and spends every day absorbing charts. But a stakeholder who is presented with this chart will probably struggle to understand what they’re looking at— and the chart’s lack of visual appeal makes it unlikely that they will be engaged enough to dig into it.

For example, the chart has no axis labels, titles, or any other descriptive elements. The reader doesn’t know what currency or units the values represent. The use of a line chart makes it very difficult for the reader to determine how total tax receipts are changing, There are many categories of tax receipt that represent a very small share of total tax receipts that may not be of interest to our stakeholder and simply add clutter to the chart.

Version 2

Our second version of the visualisation will be a stacked bar chart that is more accessible to our target audience. We add axis labels, titles, and other descriptive elements to make the chart easier to understand, and aggregate the smaller categories of tax receipts into a single new category “Others”.

Python
import matplotlib as mpl

# Combine the smallest categories
smallest_categories = df.sum().sort_values().index.values[:6]
df["Others"] = df.loc[:, smallest_categories].sum(axis=1)
df = df.drop(smallest_categories, axis=1)

# Plot the data as a bar chart adding axis labels and titles
fig, ax = plt.subplots()
df.plot.bar(figsize=[14, 6], ax=ax, stacked=True)
labels = ax.get_xticklabels()
for i, l in enumerate(labels):
    val = int(l.get_text())
    if val % 5 != 0:
        labels[i] = ''
plt.gca().set_xticklabels(labels)

# Format the y-axis tick labels
mkfunc = lambda x, pos: '%1.0fB' % (x * 1e-6)
mkformatter = mpl.ticker.FuncFormatter(mkfunc)
ax.yaxis.set_major_formatter(mkformatter)

plt.xticks(rotation=0)
plt.xlabel('Year')
plt.ylabel('Tax Receipts')
plt.title('Irish Tax Receipts from 1984 until Today')
plt.legend(title="Category")
plt.grid(axis='y')
plt.show()

This chart is much easier to absorb. Our stakeholder can immediately see that tax receipts have been increasing over time and that the majority of tax receipts come from income tax and VAT. They can see that corporation tax is increasingly contributing to tax receipts, and that excise duties and other taxes are relatively small contributors.

However, the chart does not tell a story. There is no clear message or narrative. Different readers will notice different patterns in the data. As data scientists, we have a unique combination of skills that allow us to not only collect a dataset and visualise it, but also to identify the most important patterns in the data, understand the context of those patterns and understand which of those patterns are statistically reliable, and meaningful and relevant to our stakeholders.

By having a specific story in mind when creating a visualisation, we can help ensure that the data is being presented in a way that is both accurate and informative. It is our responsibility to communicate these findings to our stakeholders, rather than expecting them to notice idiosyncratic patterns in the data that may not be reliable. Ultimately, by presenting the data in a clear and compelling way, we can help our stakeholders make better decisions based on data-driven insights. One very simple way our data scientist can communicate their findings is to use add annotations.

Version 3

Our third version of the visualisation will include annotations that actually communicate the key message we want to convey. We want to highlight significant changes in tax receipts, like those that happened during the global financial crisis in 2008 and the recovery period in the years following the crisis.

Python
fig.gca().axvspan(1995-1984-0.5, 2007-1984+0.4, alpha=0.2)
fig.gca().annotate('Celtic Tiger', xy=(1995-1984, 80), va="top")
fig.gca().axvspan(2008-1984-0.4, 2013-1984+0.5, alpha=0.2)
fig.gca().annotate('Irish Banking\nCrisis', xy=(2008-1984, 80), va="top")
fig.gca().axvspan(2020-1984-0.5, 2022-1984+0.5, alpha=0.2)
fig.gca().annotate('COVID-19\npandemic', xy=(2020-1984, 80), va="top")
fig

The annotations provide more context to the plot. We can see that the Irish economy experienced a boom in the late 1990s, followed by a sharp decline during the global financial crisis in 2008. However, the Irish economy has since recovered, with tax receipts reaching pre-crisis levels in 2014.

While the annotations does convey a simple story to our reader, the chart itself could be a little more fun to play with! Let’s enhance the plot further by using a more sophisticated library, which provides interactive and flexible visualisations.

Version 4

Our final version of the visualisation uses Altair, an interactive visualisation library for Python.

Python
import plotly.express as px

# Define the figure
fig = px.bar((df.stack()*1000).reset_index(name='amount'),
             x='year', 
             y='amount', 
             color='category', 
             title='Irish Tax Receipts from 1984 until Today',
             barmode='stack')


# Add shaded spans
fig.add_shape(type='rect',
              x0=1995-0.45, x1=2007+0.45, y0=0, y1=85_000_000_000,
              fillcolor='grey', opacity=0.2, layer='below')
# Add annotations
fig.add_annotation(text='Celtic Tiger', x=1995-0.25, y=75_000_000_000, xanchor='left',
                   showarrow=False, font=dict(color='#373737', size=14))
fig.add_shape(type='rect',
              x0=2008-0.45, x1=2013+0.45, y0=0, y1=85_000_000_000,
              fillcolor='grey', opacity=0.2, layer='below')
fig.add_annotation(text='Irish Banking\nCrisis', x=2008-0.25, y=75_000_000_000, xanchor='left',
                   showarrow=False, font=dict(color='#373737', size=14))
fig.add_shape(type='rect',
              x0=2020-0.45, x1=2022+0.45, y0=0, y1=85_000_000_000,
              fillcolor='grey', opacity=0.2, layer='below')
fig.add_annotation(text='COVID-19<br>pandemic', x=2020-0.25, y=75_000_000_000, xanchor='left',
                   showarrow=False, font=dict(color='#373737', size=14))

# Revise the chart labels
fig.update_xaxes(title='Year')
fig.update_yaxes(title='Tax Receipts (EUR)')
fig.update_layout(legend_title='Category')


# Format the tooltip
fig.update_traces(
    hovertemplate="<br>".join(
        ["Year = %{x}", "Category = %{fullData.name}", "Sales = €%{value:.2s}B", "<extra></extra>"]
    ),
)

# Hide the legend
fig.update_layout(showlegend=False)

# Show the figure
fig.show()

The interactive chart provides a more engaging way to explore the data. The reader can hover over the chart to see the exact tax receipts for each year. The chart also provides a zoom-in and zoom-out function, which allows them to focus on specific time periods.

Conclusion

Data storytelling is an essential part of data science. By using data visualisation techniques, we can convert large data sets into meaningful insights that can help us make better decisions. In this blog post, we demonstrated how to tell a story using a data set of Irish annual tax receipts from 1994 until today. We started by visualising the data in a basic way using Python’s Matplotlib library. We then added annotations and labels to improve the plot’s narrative. Finally, we used Altair to create an interactive chart that provided a more engaging way to explore the data.

When telling a story with data, it’s important to keep the following principles in mind:

  1. Use clear and concise visualisations that are easy to understand.
  2. Identify the key message you want to convey.
  3. Provide context by adding annotations and labels.
  4. Use interactive visualisations to engage your audience.

By following these principles, you can create compelling data stories that can help you make better decisions.