Managing Stubborn, Elusive Data
The data is there, but you’re the PM for it now
I wrote about how being a Product Manager for a data product is different than being a Product Manager for data here.
One of the questions I encounter more frequently from PMs managing data is about frameworks and metrics that we could use to efficiently track and manage all of our organizational data. In this post, I want to take some time to go over some processes that have worked for me, or for PMs I’ve worked with, when it comes to managing data.
Getting Product Managers to manage data is a relatively new development, and was a role traditionally handled by many different people, from Program Managers to Technical Product Managers, to even Data Engineers. It doesn’t matter what your title is, if you’re managing data to make it consumable by analytics, and if one of your primary goals is to make data generated from your product easily accessible and processed for analytics, these processes will work for you.
There are, however, slightly different approaches to taming crazy data depending on whether you are inheriting a data warehouse that was built by engineers, or whether you are ideating on data management based on needs your organization foresees.
I’ll tackle the first problem in this post, and the second problem in a separate post soon.
The data is there, but you’re the PM for it now
As hard as this seems, if this is you right now, you’ve actually got it easier than those who don’t have the data to start with. Having said that, it’s actually difficult to know where to start, it can be very overwhelming. Here are some of the things you can do to get started and on your way to managing data.
1. Why do you need the data
That’s the first thing you need to know — what is this data needed for? Is it to track KPIs, is it for more analysis, and what happens when the data is wrong? How big is the impact?
Knowing this helps a PM determine which data is critical to the success of the organization and prioritize the data on importance.
Outcome: Prioritization of different types of data, data attributes and data stores.
2. Who uses the data
Answering the first question will lead you down this path, somewhat. Digging this down explicitly is important though. Obviously, if someone needs the data to calculate KPIs, they are using it. But are there other stakeholders who also need this data in some capacity?
For reconciling revenue, for creating machine learning models that determine product strategy (a data product) elsewhere in the organization? Is a senior executive using this data for financial forecasting?
Knowing the answer to this question leads you to your key stakeholders and makes you aware of all the downstream processes you are likely to impact any time you change the data.
Outcome: Stakeholders are identified, and the ways they use the data are documented. At this point, you know how often everyone uses this data, how often is downtime acceptable for your data warehouse, how fresh the data needs to be and so on. You should also have an idea of what data is unnecessary and the storage costs aren’t justified by the impact.
3. How does the data change?
Meet with the engineers, data scientists and architects who built the data warehouse. To be able to successfully manage data, you will need to know the source of input for the data, and how it transforms from the input all the way to the warehouse.
It’s important to ask when entries are made into the warehouse, what changes are made to them, what underlying assumptions are made when these transformations happen and what ways exist to track and monitor transformations. In other words, if a data field is an addition of two fields, where is this addition happening? Are there any conditions for the addition to take place? When does the addition not happen?
Outcome: Because for you, data is the product, this step familiarizes you with the product flow. You should be able to map the data to an architecture diagram at this point and know what data elements are introduced, deleted and changed at each new stage of the flow.
4. Where is the data most vulnerable?
Now that you have this data transformation documented, you need to know where your product fails, and which failures are worse than others. This is a difficult nut to crack, because no one who designs a system will ever think it could fail.
You will need to challenge your engineers and architects to identify these failure points. Working backwards from the stakeholders will always help here. If our CFO got the wrong revenue numbers for March for the investor meeting in April, how bad would it be? Pretty bad. Well, ok. How could the revenue be wrong?
And if you work backwards to every point in your diagram where the revenue is touched in anyway (where it’s the dependent variable for math nerds), you’ve got a risk. Factor in business logic for the data transforms and you’ve just identified your big vulnerabilities.
Outcome: This is a good time to re-assess risk and add a column to your prioritization in Step 1: Risk. A data element or process that is high impact and high risk becomes your first priority.
5. How is the data quality tracked?
This one is important. Because very few organizations know how to do it well, they simply don’t do it. If your data has been sitting with engineers for a while, without a business case, data quality has likely been forgotten.
Remember the assumptions and transformations from Step 3? At each transformation, what are the tests being done to validate these transformations? To go back to our example, is anyone checking that a field which is “supposed” to be the addition of two other fields is always addition and not often multiplication? Is a character field that is not supposed to be null ever null?
Do we know how often that is? How much of your data is incomplete? Do we know why? If the answers to these questions are “I don’t know,” you need data KPIs and metrics on data quality. If you find there is data needed by stakeholders that’s missing, or redundant, it needs to be tracked here as well.
Outcome: You should now have a document with details on test coverage, KPIs to track for data quality and metrics on how to measure quality. You should be able to map your document from Step 4 to this list and prioritize your most important Data Quality KPIs.
6. Get the processes in place
This is going to be a work in progress. You will need to identify a process in place for keeping your data healthy and managed in its most normal form (easy to access and granular, independently represented data). You need to work down your final priority list from Step 6 and put in place the testing for data quality, adding in missing data as needed and adding in audits for the data.
You will need to run your plan and processes by your stakeholders, especially if you’re changing the way data is being accessed, stored and refreshed. If data accuracy is a challenge for a certain period of time, that communication should be done up front. Stakeholders need to know to what degree they can trust the data they’re using and what are the caveats to using it.
There needs to be a process to request changes to data, or new use cases, and a clear communication on what you will need as a PM to make a business case for prioritizing one data request over another.
Outcome: Best practices documentation, test cases, coverage analytics that are updated frequently, and a roadmap on how the data will be maintained and grown at a quarterly, half-yearly and yearly cadence. At this point, you know what your success metrics are, and if you have OKRs at your company, you can set those for yourself and your team.
Ultimately, every time you identify new uses for the data and new stakeholders, you will iterate Steps 1–6 and build your roadmap from there. There will be other things you might have to think about later, such as managing data on-premise or on the cloud and how much to pay for each, how to handle broad infrastructural changes that could affect a wide cross-section of your stakeholders and how to deprecate data (which is equally important).
Ideating on a new data management process and infrastructure is a whole other ball game, and that warrants another post, so stay tuned!