“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.“: Abraham Lincoln
This quote by the popular American president is perfectly suited for the world of data visualizations. As any expert would tell, it is futile to start working on a data visualization project without preparing the data first. It is important to invest adequate time and resources into organizing data to ensure the overall efficiency and the success of data visualizations.
As we continue to work with more diverse, scalable, and complex business data, adequate data preparation is a mandatory prerequisite. A Forrester survey found that nearly 66% of businesses have implemented a data preparation solution to manage the growing volume of data, while 56% have done so to help them work with multiple data sources.
What exactly is data preparation? Essentially, it is the process of collecting, cleaning, and consolidating data into one file or data table, primarily for use in data analytics.
Having helped several companies, including Fortune 500 companies, with their data visualization initiatives, we recognize that clean data is crucial for better data quality and design and eventually for data visualizations. How can any business go about preparing their data for better data visualization? Here are a few essential steps:
Gather and Assess Data
The first step before embarking on any data-related process is to find the right data. Essentially, the quality of data gathering depends on how data professionals approach the following questions:
- Should I use the existing datasets or look at newer sources?
- What kind of data sources do I need for this visualization? For example, social media, business databases, customer transactions, or website data.
- What kind of collection methods or channels are available? For example, data collection tools include online forums, interviews, or online questionnaires.
For the best actionable insights, businesses need proper data from a mix of various data sources. For example, for visualizing wage analysis in any business, data sources include employee records, compensation, and market value.
Clean the Gathered Data
In this New York Times article, data scientists reveal that they spend 50-80% of their work time cleaning and organizing data. With unstructured data coming from multiple sources in different formats, data cleansing is critical to delivering useful data for the right insights.
If the gathered data is not cleaned, then it leads to low-quality data leading to high chances of the final data model being unreliable, corrupted, or inaccurate.
Traditionally, data cleaning is the most time-consuming part of data preparation, as it typically includes steps like:
- Removing duplicate entries
- Fixing structural errors
- Adding missing values
- Filtering unwanted data entries
- Handling the missing data
Here are a few of the effective guidelines and tips on how to clean raw data tables for creating great visualizations:
- Remove duplicate and irrelevant data entries: Duplicate entries are often made during data collection when datasets from multiple sources are combined. Irrelevant entries are those that do not fit into the specific business problem. Using such data for visualizations can result in inaccurate output, leading to wrong business decisions and increased costs. During data cleaning, ensure to remove multiple headers and merge them within one data table.
- Delete unused columns: The purpose of data visualization is to derive valuable insights from business data and find out what is useful (or not) for analysis. Unused data columns are often added as they are part of the original dataset. During data cleaning, analyze and retain those columns that are going to be used in the visualizations or generated reports, else simply delete them.
- Clean any annotations in the columns’ values: Any column value in the data table must be of a single data type. For example, add a number (where only numbers are expected) or a date in the “Date” field. The golden rule for annotating a value is to add the textual annotation in a separate description box. Remove or clean any annotation, if added in the column values. For example, for year-wise data on population, add the annotation as “Population figures for 2000 doesn’t include XX city.”
- Combine rows or columns if the data is very sparse: Keeping sparse data spread over many rows or columns can increase the data size and adversely impact performance. As a good practice, try to combine the rows or columns and aggregate the datasets to reduce the overall file size.
- Fix irregular cardinality: High cardinality can impact performance and are not useful for data visualization. For example, data columns with a cardinality of 1, zero, or very low variance can be dropped. Similarly, values with up to 10 decimal places can result in high cardinality (which must be reduced to around 2 decimal places).
Transform the Data
Data transformation and enrichment is the process of adding and merging either first-party or third-party data to a dataset that an organization is already working with. Enriched data is an asset for data visualization as it provides more valuable insights.
As more business brands look to enrich their raw data, they can use it to make informed decisions. For example, businesses can enrich internal sales data with third-party advertisement data to get a better understanding of the effectiveness of their marketing campaigns and advertisements.
Store the Enriched Data
The final step towards data preparation is to unify the data from various sources and store it in an accessible location. For instance, data can be channeled into a third-party business intelligence (BI) tool.
With data forming the backbone of visualization, data preparation is essential when working with large volumes of raw data. On its part, proper data preparation is useful for improving data quality for analysis by removing errors and normalizing the data.
At Heptagon, we transform data and convert it into actionable insights that work for your business success. From data acquisition to cleansing and processing, we can help you make better business decisions with our analytics services. Contact us with your queries right away.