Cracking the Data BA Code — A Cheatsheet
It feels good to be back to writing. It seems like forever really and so much has happened- had a kid who is all of 9 months now and released my first album. Whew!
In that time, I played the role of a data product manager and understood that there is some general confusion and a newness to the concept of a data BA. Hence, this blog.
The arena of business analysis has seen the rise of a new species of analysts- the slightly impressive sounding and increasingly relevant- data BAs. Not data analysts, forecasters or scientists, but data BAs.
What does a data BA do, you ask?
Or maybe you are one or you know one, and you want to find out what takes to be good at it.
It is not going to be (I hope) a preachy sermon. I don’t have enough years behind me doing this to be preachy (not that I would do it otherwise anyway. I mean who likes being preached to? :) )
But Being a data BA / data product manager for a year, I understand this space enough to know what good looks like. I am somewhere in this sojourn and would like to share what I learnt so far.
So, what does a data BA do?
I think a data BA supports a data initiative in the organisation from the perspective of processes, value identification and articulation and release strategy.
I will talk about the process aspects today and how they play a role in determining the success of the DataBA.
Before I navigate any further, I must underscore the fact that I have a terrific data Dev team at Thoughtworks to work with. This has actually helped me to focus only on the core analysis part of this role, without needing to get too technical unless the team needed it. That has several advantages, the chief one being freedom to do core data business analysis.
Alright. So, let’s get down to the “process” around data projects and where a BA’s skills intersect aspects of it.
A data project can involve some or all the following.
Data ingestion,
data transformation,
data quality management,
maintaining a lineage or snapshot,
managing PII data,
generating business reports,
managing errors to name a few.
Each of these has many considerations of its own.
For example, ingestion can happen from various sources such as cloud storage, streaming services (passed down to a cloud storage), a traditional data warehouse, APIs that send over feeds etc. Transformation includes compressions, any boolean-string or vice-versa translations, lookups etc. Data quality (popularly shortened to DQ) is always an interesting arena which lets you explore what quality thresholds to adhere to. If you are looking at fraud data for example, your DQ needs to be at 100% or as near as you can get to that. If you are looking at clickstream analysis, you may live with 2–3% of errors on your tables.
As the data BA, you don’t need to know these figures (and they are not an exact science anyway). What is important is your ability to ask questions and validate whether all these aspects have been thought about. What is also important for us to understand the process thoroughly and know who owns each step of the process and what SLAs / commitments the owners can be held accountable for.
The above was to give you a taste of the kind of work that goes on in a day-in-the-life plus reveal some of the more subjective.
Time to get to the best stuff in the blog!
Here are some ready reckoner questions/cheat-sheet for the data BA.
- What is the purpose of collecting or analysing this data? What are the business questions this data will help us answer?
- What is the source of the data? How will they send data? What is the frequency of the data? What is the expected quality of the data-at-source?
- Is the single source of data enough to be useful? Do we need to “blend” data in from other sources or get it from some third-party sources?
- Is it a go-forward ingestion or do we need to ingest history?
- What transformations does this data need to go through? What cleansing or validation would need to be in place?
- What are any applicable compliance or regulatory requirements around this data? How would we honour those requirements, when we receive requests to delete the data?
- How frequently and with what concurrency will the data be accessed, once we ingest it?
- What partitioning would we need to consider? What sort of queries are the most expected, to help guide the partitioning strategy?
- What should the strategies around managing governance, auditing?
Each of the above is probably a blog in itself. I intend to bring out some nuances that are peculiar to each discussion around those topics.
Another point- all the questions above need not be asked to all stakeholders. It’s generally a good practice to start asking those questions within your engineering team first and get alignment before taking them to the wider group.
If you don’t follow some topics mentioned in the questions, don’t worry. I cannot go into any more detail here in the interest of length, but the jargons are easy to find and understand. If you are in a data project, your engineers can surely help you or even a simple Google search. The idea is to have you armed with a list of ready reckoner questions that gets the conversations going and it covers all important bases for data.
Please note that many modern tools like Snowflake or Databricks manage many things like partitioning, governance, auditing etc. out of the box, so some of the above questions would need re-looking if that’s the case. But even then, asking them will make you look aware of the needs and then the tool-experts can answer by saying what works automatically and what needs work.
I hope the above list and the outlining helps you get started or get accelerated in your data BA journey. Did I miss any questions that I should have stated? Were they too tech for a BA? Let me know in the comments!