Data integration is one of the most impactful initiatives you can take to benefit your organization. There’s a wealth of knowledge to be gained by having better visibility of the data in the systems you use every day. Harnessing your data by bringing it together into one location can provide insights that have the potential to improve everything your company does.
To achieve data integration, you have two options: Use a data integration platform, or write custom code. Whichever you choose, it’s critical to take time to think through the process before you start. In this article, we’ll walk you step-by-step through everything you need to consider.
Define Your End Goal
For any data integration project, it’s best to begin with the end in mind. What do you want to accomplish by bringing all of your data together? Many companies seek to achieve more in-depth reporting and analytics, enabling them to evaluate and improve their operations.
Let’s say, for example, that your ultimate goal is to use a business intelligence (“BI”) tool to develop a dashboard to help executives have better visibility of key metrics. Your first step is to determine which fields you’ll need for the charts, KPIs, tables etc. you want to create. This will show you how much data you need to push through your data pipeline. It doesn’t make sense to put in the time and effort to pull together all of your data right from the start if what you really need are a handful of fields to feed a few charts.
Instead, take a phased approach by limiting your first phase to only the data required to reach your immediate goal. You can always add more in the future. Better to achieve a few quick wins than to create a monster project that takes forever to complete. Phasing in gradually has the added benefit that if you decide after phase one that you want to switch from hand-coding to using a data integration platform, or move from one data integration platform to another, you’ve minimized your sunk costs.
Explore Your Options
After you’ve defined your goal, the next step is to figure out exactly how you’re going to get there. As mentioned above, you can either hand-code your solution or use a commercial data integration platform.
If you’re considering a data integration platform, allocate adequate time to research your options. Plan on several weeks at a minimum to complete this task. There are many platforms on the market, and depending on your needs, not all will be a good fit.
A good place to start is the research firm Gartner. You can pay to receive a detailed report from them, or you can just search for “data integration gartner magic quadrant”. This is a chart summarizing how Gartner ranks each of the data integration platforms. Starting with a resource like this will let you know who the players are and get you pointed in the right direction. Pick three or four platforms, go to the companies’ websites, and sign up for their demos. (Read about my experience going through the demo process.)
On the other hand, if you’re planning to hand-code your data integration solution, you’ll need to pick a coding language. As with data integration platforms, not all coding languages will give you what you need to complete your data integration project, so be sure to do your due diligence before you commit to a language.
Your primary concern at this point, whether you’re writing custom code or using a data integration platform, is to make sure that the solution you choose will be able to communicate with your data systems. Data systems fall into two categories: Those you’ll retrieve data from, called “sources”, and those you’ll send data to, called “targets” (or “destinations”). Since your goal, in most cases, is to bring your data together into one location, you’ll be dealing with multiple source systems and a single target system. We’ll delve deeper into both kinds of systems in the next two sections.
For now, be aware that some data integration platforms support almost any source and target, while others support only the most popular ones. The same applies to coding languages. Whatever solution you choose, it must fully support both your source systems and your target system. If a data integration platform or coding language you’re considering has no way to communicate with one of your systems, that’s a deal-breaker.
Identify Your Sources
Okay, it’s time to get a bit more technical. Start by listing out all of the sources from which you need to retrieve, or “ingest”, your data. Typical sources are databases, files, and APIs. There are many other types, but these are the most common by far.
To ingest data from a database, you’ll use something called a “connector”. This is a piece of software that enables communication with the database.
Data integration platforms have built-in connectors for most major databases. If your source database appears on the platform’s list of connectors, you should have no issues retrieving your data. If it’s not on the list, you may still be able to use a one-size-fits-all connector, called an “ODBC” connector, which functions as a generic way to talk to many different databases.
Similarly, a hand-coded solution will require the availability of a connector to communicate with a database. Most database vendors offer connectors (also called “drivers” or “clients”) for popular coding languages. As with data integration platforms, if there isn’t a connector available for your database in your coding language of choice, you can usually take advantage of ODBC as a fallback.
A file or API data source sometimes gets a little trickier. Ingesting data from a standalone Excel file, for example, should be a piece of cake. However, you may be dealing with a scenario where Excel files with unknown names are being dumped into a folder at unpredictable intervals throughout the day. In this case, you need a way to monitor the folder and process new files as they arrive. Whatever your workflow, if you’re going to use a data integration platform, you’ll want to make sure the platform can accommodate it before you make your purchase. This is why it’s important to go through vendor demos–to see that the vendor’s platform can handle what you’re going to throw at it.
APIs can present challenges too. If you’re using a data integration platform and a connector exists for the API you’re dealing with, you shouldn’t face much difficulty. But if you’re rolling your own code, things can get more complicated. For example, an API that returns data as XML, which is a hierarchical (nested) format, does not easily translate to a relational database that’s organized into rows and columns within a collection of tables. Be aware that developing a solution to handle this kind of source may require significant time and resources.
Identify Your Target
Just like with your sources, you need to make sure your data integration platform or coding language will be able to communicate with your target system. That means all the same considerations about connectors discussed above apply here as well. However, there are additional things to think about when it comes to your target.
One of the first things you’ll need to decide is what your data will look like in your target. Will you simply dump all your data as-is into your target and sort out data quality issues later? This kind of target is typically referred to as a “data lake”. Doing it this way follows the “ELT” model, which stands for extract, load, transform. That means you ingest data from your source and load it right into your target without manipulating, or “transforming”, your data in any way. One reason you might postpone the transformation step like this would be if the ultimate consumers of the data were data scientists or data analysts, who usually prefer to work with the data in its raw form.
On the other hand, if you’re combining data from different sources and pushing everything into a database, often called a “data warehouse”, this process follows the “ETL” model. Here, the transformation step happens before loading because your data must fit into your predefined database schema. “Schema” is a technical term that refers to how your data is organized.
So if your target is a data warehouse, you’ll need to decide what your schema will look like. Are you simply replicating data from a single source database? If so, your source and target will both have identical schemas. Most data integration projects, however, are not that simple. More likely, your data will be coming from multiple sources with different schemas, and you’ll need to design your target schema to accommodate all of the fields you’re ingesting.
While schema design is a topic that’s beyond the scope of this article, a simple rule to keep in mind is to make sure your schema supports your end goal. In other words, if your end goal is a new dashboard, design your schema in a way that will make creating the visualizations for that dashboard easy. You may need to do some research on database schemas. It’s worth taking the time to get this step right, even if it requires some trial and error. A bad schema design will only create headaches for you and others in the long run.
Understand Your Data
Once you’ve thought through your sources and target, the next step is to make sure you thoroughly understand the data you’re working with, especially if there will be any transformations happening in your data pipeline. Having a clear and complete understanding of your data is often called “data literacy”. This concept may sound elementary, but ignoring it will negatively affect the outcome of your data integration project.
Let’s take the topic of field names as an example–specifically, poorly chosen or undocumented field names. This issue creates as much confusion as almost any other issue in data integration. Legacy databases are common culprits because they tend to have short, cryptic field names, and fuller descriptions may or may not exist elsewhere. Even if the name of a field is clear, what data the field contains may not be (e.g., “cost” can include different things depending upon your organization’s accounting practices). This is where digging into the documentation becomes essential. In the absence of documentation, you’ll often need to reach out to the vendor (or developer) who created the field in the first place.
It’s worth taking the time to choose carefully what field names to use in your target. You don’t want to perpetuate naming issues from your source. Sloppy naming can lead to fields used incorrectly in reports, which results in misleading analytics, potentially steering your organization in the wrong direction.
In addition to precise naming, choosing your data types properly will ensure your data is correctly prepared for reporting and visualizations. Data can be numerical (e.g., price), character-based (e.g., name), boolean (e.g., true/false) or even binary (e.g., an image). Within each of these data type categories, there are further distinctions (e.g., decimal vs. whole number). If you are combining data from several sources into a single target, rarely will fields from different sources that contain the same data have identical data types. Define the field in your target the way that most accurately captures your data and will best support your reporting and analytics goals.
Outline Data Mapping
After you’ve identified your sources, defined your target, and have a clear understanding of the data you’re working with, it’s time to draw up how each piece of data will move from source to target. If you’re using a data integration platform, you’ll be able to do this right in the software. If you’re hand-coding, it can be helpful to put together a spreadsheet with the source fields in one column and the target fields in another column.
For moving data into a data warehouse, this is where you begin to identify what data from your sources must be transformed. It may need to be split into separate target fields, combined into a single target field, deduplicated (eliminate redundant data), cleaned (e.g., remove special characters in phone numbers) converted (e.g., units of measure, like pounds to kilograms), or any number of other transformations to make sure the data that ends up in your data warehouse is of the highest quality.
Begin Building Your Data Pipeline
At this point, we’ve walked through everything you need to set up your data integration. Congratulations! You’re ready to begin building out your data pipeline.
If you’re planning to use a data integration platform and you’ve gone through the demos, pick the one you think best meets your needs and start with a trial version (most vendors offer trials). Until you actually work with a platform in your particular environment, you won’t know whether all the features promised by the vendor actually work as advertised with your specific set of systems.
If you’re hand-coding, start small. Begin with a few fields from a database source or file source, and get the integration working with those few fields first. For an API source, choose the simplest endpoint. Once you get that working, you can move on to adding more fields or interacting with more complex endpoints.
Caveat: If you’re hand-coding, be sure to take a careful look at all of the data in your sources, even though you won’t be dealing with all of it in phase one. Getting an overall picture of your data will help you avoid building an inflexible framework that you’ll end up having to overhaul during the next phase to accommodate something you didn’t think about.
Building a data pipeline involves chaining several technologies together, and as we all know, a chain is only as strong as its weakest link. Making the effort to follow everything outlined here will enable you to build a robust data pipeline and ensure a successful data integration project for your organization.