Squirro Podcast

Between the Spreadsheets - Data Classification

Written by Steven Grinberg | Aug 17, 2022 6:05:59 PM

With Susan Walsh

Susan Walsh is the Founder and Managing Director of The Classification Guru Limited, a specialist data classification, taxonomy customization, and data cleansing consultancy. Now, she is an industry thought leader. She's a TEDx speaker and she’s the author of Between the Spreadsheets: Classifying and Fixing Dirty Data. She's also the founder of COAT, a methodology that her team uses to classify, normalize, cleanse, and check data for errors, which does help prevent costly data mistakes accurately and efficiently.

Full Transcript

Susan:
There is this single, singular kind of perception of the data space. And I want to change that because I buck all the trends. I don't do math at all, I failed it. I'm terrible at it. I love data. It's all text and pattern recognition that I work with when I'm working in my software. And I'm really passionate about it. It's fascinating.

But we don't always get to see that side. And that's fair enough, because there are genuinely a lot of introverted people in this space. So, I feel it's my responsibility on behalf of the people who don't want to speak up or can't, to say, "Look, this can be fun. It can be really interesting. You can get so much out of it. And data people are not boring." The best community that I have are the data people on LinkedIn.

Lauren:
Hi, everyone. My name is

Lauren: Hawker Zafer and this is Redefining AI. Redefining AI is a podcast hosted by Squirro and the Squirro Academy. The podcast focuses on key tips and discussions that drive digital innovation and help people understand artificial intelligence, machine learning, data analytics, and no-code AI.

Today, I've been joined by Susan Walsh. I'm enthusiastic about the discussion that lies ahead. Susan is the Founder and Managing Director of The Classification Guru Limited, a specialist data classification, taxonomy customization, and data cleansing consultancy. Now, she is an industry thought leader. She's a TEDx speaker. And she's an author of Between the Spreadsheets: Classifying and Fixing Dirty Data. She's also the founder of COAT, a methodology that her team uses to accurately and efficiently classify, normalize, cleanse, and check data for errors, which does help prevent costly data mistakes.

Now today, I'll be talking to Susan about data, dirty data in particular, and why she is so passionately engaged in a mission to fix data and corporate data landscapes. But just before I hand the mic over to Susan, I want to start by setting the scene a little.

If you've ever analyzed data, you can emphasize with the pain of exploring your data only to find that it is dirty, poorly structures, filled of inaccuracies, siloed, or perhaps incomplete. Slower processes that are affected by the state of data can ultimately lead to missed opportunities and lost revenue. With Gartner 2021 research indicating that the average financial impact of poor data quality on organizations is around 9.7-million per year. Enterprises are of course, taking steps to overcome dirty data. And someone who can really help us find out more about those steps and the core qualities of dirty data is Susan. Welcome, Susan.

Susan:
Thanks,

Lauren:. Great to be speaking to you, fellow Scot.

Lauren:
Yeah, exactly. It's wonderful to be able to welcome a fellow Scot.

Susan:
No one can say "squirrel" like we do.

Lauren:
That is true, Squirrel. And I'm hoping that everyone can understand us. If they don't understand the conversation they are at least going to get a test of authentic Scottish accents. Are you in Scotland at the moment, Susan?

Susan:
No. I live outside of London now and have for 20-years. My accent is very toned down and I've slowed it down. But, you know, speaking to other Scots, I do kind of speed up. So, I'll be very careful.

Lauren:
And so, this is a first, it's another first, a Classification Guru. I mean, in Series 1 of Redefining AI, I've had the pleasure to talk to an evangelist, an astrophysicist, but never a guru, which is as we know, by definition, a Sanskrit term for a mentor, guide, expert, or master of certain knowledge or a field. So, given this Susan, what is a classification guru? And who is in need of one?

Susan:
Well, I would say it's a self-appointed, self-claimed role. But when I left the spend analytics company that I was working for and had been for five-years, and needed a business name, I didn't want, you know, Susan Walsh Consulting or Walsh Consulting or Data something. I wanted something that explained what I did in the business name. Because as part of my job, I classify company names. And the number of company names that have no relevance to what they actually do is quite staggering. So, I wanted to make sure that it was related to the services that I was going to provide.

And the kind of people who need a classification guru initially, were just in the procurement space to look at how they spend, data classified, and their suppliers normalized. So, they could see how much they were spending with each supplier across the globe and what categories. However, that has now expanded into CRM systems, changing ERP systems. I do any kind of cleansing of databases, as well as categorization of spend data and soon also, retail category management marketing data too. So, what we do can be applied to all different areas of data.

Lauren:
Okay, I mean, that sounds like a lot of data flow, along the classification of lots of different data systems. Why is it spread from, as you mentioned, like procurement into so many different data systems?

Susan:
I think the tagline, fixer of dirty data, really caught the attention of a lot of people, outside of just the procurement world. And people will now come to me and say, "I'm not even sure if you can help me with this. But we've got this problem, do you think you can help?" And I'll either say, "I've never done it before, but I think I can, let's give it a go." Or I'll say, "No, I'm sorry, I can't do that. You'll want to talk to someone else." And that's where it started from.

And the other thing is, there are just so many data problems out there in every single area of business. Everybody needs help. This is not a siloed problem. This is across the board.

Lauren:
Okay. Yeah, I mean, I just actually read and heard recently that Andrew Ng on LinkedIn that he was stressing that data quality is really something that needs to be looked at and it's an approach that should be taken to build foundations for successful AI deployment in particular. But we really need to be looking at the core fundamentals of data.

Susan:
It all starts with this, it really does.

Lauren:
Before we go into looking at sort of dirty data, I just want to go back to one word. So, classification itself, what is classification? Because I imagine that it can be interpreted in quite a few different ways.

Susan:
Yeah, interestingly, when I set up the business, I found out there was a different kind of classification to the classification, I knew. My classification is taking a supplier name, and it could be something like Dell, and then a have just like a bank statement, there'll be a whole load of lines of from the invoice of things that have been purchased by that company. And it will look like keyboards, a mouse, a monitor, some cables, a laptop, maybe a tablet. And either using an existing taxonomy or we will build a customized taxonomy, we might say level one is going to be IT. Level two is going to be hardware. And then, level three will either be laptops, accessories, and peripherals. And you could actually go into further level of detail beyond that, depending on who the client is.

What that means is, when we roll up all our categorized classified data, as a company we know, we spent x-million with IT this year, x-million with professional services, x-million with facilities or travel, etc. And if we want to dig deeper in travel, maybe we want to know how many airplane tickets we bought this year, or hotel stays, we can then go into that level of detail and start to negotiate better rates with suppliers, like on a global basis. Like hotels, you know, you could have a global rate with certain hotel chains for all your staff rather than paying hugely varying rates in different countries for basically the same thing.

Lauren:
Okay, interesting, great. It's good to keep a sort of an alignment with what you would define this classification and what it is.

Susan:
Yeah. And the other side of that, the classification that I don't do, is file hierarchy security classification. So, you know, sensitive, confidential, etc. That scares me. That's not my thing.

Lauren:
Okay, so the word data itself can also be sort of classified or modified by adjectives. And many have become relatively fond of speaking and define in detail using the word or the adjective, dirty. Now, and maybe following two trails of thought here, I would start by asking you, what is dirty data? And we've touched upon it very, very briefly. Why is it being talked about so often in the present?

Susan:
So, I mean, I've got about eight different types of dirty data that I talk about. And that's things like missing information, typos, information in the wrong columns, you see that quite a lot, where you have a city and the zip code or the postcode or the town in the address level one. There can be different units of measurement, you know, different countries are measuring things different ways, kilometers, meters, etc. Formatting, particularly dates is a huge problem across the globe. You can't get documents to match up because the dates are different in different countries. We've got duplicates. We can't forget our old friends, the duplicates. I think that's six. I can't remember what I've done with the other two.

Lauren:
Outdated?

Susan:
Outdated, could be one. Or misleading data. Missing data, I think we've talked about. Anything that's not right. And again, depending on the area of data that you work in, because it's like data is the world. Within it, there's like lots of different countries of different types of data. There will be specific data problems to certain people. But those are the common ones, I would say.

Lauren:
So, why are people talking about this so much at the moment then?

Susan:
There's been a huge focus on data, particularly since everybody had to work from home in lockdown. And businesses really started-- From my perspective anyway, businesses really started to have a look at what they were spending the money on and where, and that's what has really shone a light on this. Or they found that they didn't have the right CRM system to cope with working from home in a different environment. Or they needed to update categorization of all their products that they keep internally because not everyone is in the office now to use it. So, there has been a lot more scrutiny.

And I think data is the buzzword right now. It's where all the jobs are. Highly in demand. In fact, there is a skill shortage in the area. Everybody is looking at it. And there is more data than ever. And that's something else, it's not just about looking at the data. But you need to be looking at the right data, all the data.

Lauren:
And you feel that with this maybe overflowing amount, I mean, the necessity to point towards the fact that there is too much detail? I mean, we at Squirro as well, we look at the all the siloed experience of all data that is structure together and obviously brining adequate insights to it.

With the too much data, I mean, you've spoken about other types of dirty data and the issues that it could possibly lead to in organizations. Is there one particular dirty data of the eight you have mentioned that leads to more complicated issues in corporations?

Susan:
I would say, inconsistent data is a real problem. And when you've got silos, that's an even bigger problem. You can have the same product, for example, and a couple of different data sets. And within that, it can be tagged, labels, categorized as different things, even though it's the same thing. And this happens all the time in business. Different countries call the same thing something differently.

I've just actually been working on a taxonomy, a global one, for a food company. And we've had the whole, is it chips or fries discussion? What are we going to call it in the taxonomy because most markets call it fries. But the U.K. calls it chips. We went with fries. But you have to know to look for fries. I'm sorry, I'm going off on a tangent there.

Lauren:
No, but I think it's a good question as well, because obviously inconsistencies you have to be able to identify what is an inconsistency? Is it about the ambiguity or maybe as you've highlighted upon there like a semantic understanding or a nuance that aligns with the use of language? I mean, how do corporations even identify? Or does it require someone like yourself to be able to say, "Well, look, here are inconsistencies. And that's why we need a taxonomy, or this is why you need someone to create more structure in the data."

Susan:
Ultimately, regardless of the type of data in a silo, whatever it is, you don't have a full picture of what's going on in your industry, your project, your department, your business. And that's a problem. You could have two people replicating the same project in two different areas, because they don't know that each other are working on the same thing. Or even consistency of processes, you know. Is there two people in two different regions doing things differently and one is taking longer than the other? And about efficiencies and having all that data in one place. Instead of doing something five times across five silos, you did it once across one big silo. And it takes 1/3 of the time. It's about that as well. And that increases profitability. And we always talk about saving time, saving money. But that's not really catching the eye of the decision makers and businesses who need to invest in their data. But telling them you are going to increase profitability, well that's normally a KPI, that's going to make them sit up and thing a little bit more.

Lauren:
And understand the sale system, Susan, that you've came up with your own methodology. And a lot of organizations, they incorporate methodologies to ensure that there is the completeness, the validity, there is the consistency, and correctness of the data. Does your methodology align with that? Or what did you develop the methodology for in particular?

Susan:
The reason hat I developed was because there's a lot of people working with data who are not data professionals. And they're really intimidated by the terminology or a spreadsheet. It's very scary for them. And I wanted to come up with something that they could relate to, and they could understand, and they can work with, because you know, all the tools out there, including myself and all the services, we are fixing an existing problem. But what we really need to try and do is fix the problem at source, which is the inputting of the data. And so, I say make sure your data has its coat on, as in a jacket.

So, yeah. First of all-- Make sure that is consistent. And that means terminology, you know, units of measure, processes, how we do needs, make sure that everybody is doing it the same way. And also, from a classification point of view, if you classify everything as A, and you need to change it to B, because it's all classified as A, it's super easy to change. If you have classified it as a mixture of A and B, then it's going to be a real problem trying to classify all to B.

And then, of course, it has to be organized too. So, you know, we all have wardrobes with clothes in there that we should probably put on a nice hanger and maybe just threw it in there when we were tired on night. Data is very much the same as that. If you organize your wardrobes, all your tops together, jackets, dresses, and maybe even by color if you are like me and like everything in order. It is so easy to go in and just pull out what you need. And data is exactly the same as that. So, whether it's categorization, whether it is labeling things by region, by business unit, by person, however you need your data, have it tagged and labelled. So, when someone asks you, "Okay, well how much did we sell in this region? Or how much did we buy in this company?" You can just pull out that information in five minutes, rather than spending three days having to stick a whole bunch of spreadsheets together making sure the numbers all add up, and x, y, z.

And as accurate as it can be, let's be honest, there is no such thing as perfect data set. It's not going to stay like that for very long. Let's just get it as accurate as we can be. And then, once you have your consistency, your organization, and your accurate data, it's trustworthy. So, that means when you have to send reports to senior managers, you know that information is correct, that they are making the best business decisions that they can, that you are not going to be embarrassed by getting called up on numbers that don't add up on a chart, or they picked something up that you haven't even seen. Things like that. Just even one small mistake, you know, like Dell being classified as a hotel. If anyone sees that on a chart, they are going to suddenly question all the data and not trust any of it, even if that is the only single mistake in the data set. So, it's, you know, hopefully by putting those processes in place, from the start of the process of inputting the data, all the way along to the point where the charts are ready, or whatever it is you're doing, it could be AI, making sure it's all ready before it goes in. And then, when it comes out, making sure it's still the same. By doing that, it's going to make everyone's lives a lot easier, I think.

Lauren:
I mean, I suppose we could go back to maybe a key or core question. I mean, you said that part of what you're doing, you're addressing the source. You're addressing maybe an older problem, like going back to the fundament or you're highlighting at the end that you have this consistency, the organization, the accuracy, which then leads it to become trustworthy. How difficult is this and implementation then if you are looking at maybe a global scale or global organization when you envision multiple people? Is this part of the problem at the core? Or part of the solution? I mean, how does that evolve, if we are looking at implementation a little bit more in a larger scale?

Susan:
I think that it should be done in small chunks, small steps at a time. So, let's start with one department, test that out, see how it works. There is also definitely a change management piece around that. Because everybody accepts change at a different pace. Some people will be happy to adopt new technology straightaway, super excited to learn it. Other people will want to hold on to those spreadsheets and say, "No, and you can't have them, they are mine." And so, you have to give everybody time to adjust and come around.

And I think if you try to do everything at once, then ultimately it will fail. I think if you plan and know what you want your outcomes to be before you start, then it will make the process a lot easier. And let's be realistic, it's not going to be smooth sailing. And it could be a year or so like changing mindset, etc., particularly older generations. You know, younger generations are used to a new technology popping up every other month, whereas the older generation aren't. So, it's going to take them longer to see the benefits of it.

Lauren:
Very much so. And I think that probably a large part of what you're doing as well does involve the support that is required to ensure that there is this positive promotion.

Susan:
Yeah.

Lauren:
Of what's changing, what's happening is definitely going to be more beneficial in the long run as well.

You mentioned sort of a prominent words as well, spreadsheets, and people holding on to spreadsheets. And yeah, I mean, we've titled it Between the Spreadsheets--

Susan:
Yes.

Lauren:
--Data and Classification. And it's a term that is a word play of course, and the well known idiom. We as native English speakers know, which is between the sheets. And this is also the title of your first book, not between the sheets, but Between the Spreasheets. Now, dirt data and messages of data, they're similarly two key terms that are directly linked to your professional profile and brand identity. I mean, the analogies that are used here are unique and very original Susan, especially in the fields. And I've personally seen very few individuals or organizations who are successfully, and I highlight that word, because you're doing it very successfully which is admirable, profit and from sexualizing data and the processes associated with it. So, what inspired you to take this approach to your profession? And what do you want to motion with this?

Susan:
I think that's the best question that I've ever been answered. And honestly, I never intended or starts out with the intention of this happening. When I started my business, I had no connections in the industry, whatsoever. Not in procurement, not in data, not in business for start. And I realized very quickly that although there was a need for my services, nobody was looking for me. Because they didn't know I existed. And calling 100 people a day wasn't really going to do much for my profile. And so, I started doing a lot of activity on LinkedIn. And it took years. And eventually the title, Fixer of Dirty Data, came up. And I guess, I was surprised at how welcoming people were to that title and could see the play on words. And I think, you know, everything I do, I do it for me. Obviously, I want to grow my business, but I do it because I enjoy it. And I own it.

And I honestly thought that-- I didn't think that it would be as accepted as it has been. But at the same time, there is this singular kind of perception of the data space. And I want to change that, because I buck all the trends. I don't do maths at all, I failed it, I'm terrible at it. I love data. It's all text and pattern recognition that I work with when I'm working in my software. I'm really passionate about it. It's fascinating. But we don't always get to see that side. And that's fair enough, because there are genuinely a lot of introverted people in this space. So, I feel it's my responsibility on behalf of the people who don't want to speak up or can't, to say, "Look, it can be fun. It can be really interesting. You can get so much out of it. And data people are not boring." The best community I have are the data people on LinkedIn. You know, they have a wicked sense of humor. I think because it's me, I can get away with it. They know me, because they've grown with me over the last few years, as my business has grown, and my profile has grown.

And again, I do this mainly for the non-data people. That is who I'm targeting. I want them to see the fun side of data and just to warm to it. And so, that's why I did the book title. I think it was originally going to be called, Dirty Data, because that was my name. But then I saw an article, I think, in The Economist, called between the spreadsheets, and I was like, "I need that title." I felt it was really important to-- I want people to look at the book and pick it up just because of the title. You know, there's loads of data books out there, how to do this, how to do that. But I want data and non-data people to pick that up and read it and go, "Oh, wow, I got something out of that."

Lauren:
I mean, that's a really interesting story. And I think that, first of all, it's credible, that you've decided that maybe a standard approach to cold call them, like set and trying to build a business slowly, and it does take a lot of effort. I mean, those of us who are listening and those of us who have been part of a small net startup community as well, we understand that there is a lot of graph that goes into building a brand. It's once again, something you are doing very well.

You've said that you were maybe scared-- Well, not scared but concerned about the reaction that you might receive at following this sort of path, the way that you're embracing it. Have you had any sort of negative experience or a point in your journey that made you maybe question or want to turn back?

Susan:
I think there was one experience that I talked about in my TED talk where I was uninvited from a webinar. And that's been the only thing I've not trolled at all on LinkedIn, which is huge.

But I think the other thing, probably to see is, that we haven't talked about is I'm a female in data and a female business owner. It does make it harder to open doors. Nobody is going to open those doors, I'll just have to kick them down. Or come in with a megaphone and going, "Hello, I'm here." That's been my approach. Don't be polite and make way for me. I'm just going to barge in and let everyone know, "Hello, I've arrived. I am here." So, they have to deal with me.

Lauren:
So, in saying that, do you think that it's more difficult for females in data? And that more space should be created for females in data?

Susan:
I think you have to create your own space. I would hate to be anywhere because I'm female or anything. I want to be where I am, because I'm really good at what I do. But I also understand that I might have to shout a bit louder than the men to get heard. Because if I just do the same as the men, nobody is going to notice me. But if I do something a little bit different, it's a huge gamble. But for me, it's paid off.

Lauren:
I mean, it has paid off. If we look at success, and I mean, success is very subjective, both in terms of roads travelled to success and what it means to an individual. To every individual, success means something else. And when we talk about maybe the concepts and the underpinnings with someone like yourself, and what you've just highlighted, you are an inspiring and ambitious female in the data space. You have published a book. You're also a TEDx speaker, what you've just mentioned. And a blossoming sort of business owner.

What would then constitute as success, an ongoing success in the future to yourself?

Susan:
It's funny, because if you'd asked me this in my 20s, I would have said, it is money, and position, and job title. And it's none of those things now. For me, it's being content, being fulfilled. And just enjoying every day. Like, I genuinely love what I do. And that's the most important thing. As soon as I don't love this anymore, I won't do it.

Lauren:
And to love it because you're contributing to helping people solve a real recognized problem? Or do you love it because, as we've just talked about, the approach that you're taking, the originality that you're bringing, the sort of mold that you're breaking, you're shaping a new perception of what can be done with data. And I would say you're also encouraging people to come into the data spaces, both females and people that are maybe not that interested in data.

Susan:
Yeah. I genuinely love the difference that I can make for my clients. But, unfortunately, most of them I can't talk about because they do not want to admit that they have dirty data. So, how I get around that is by promoting the business in other ways. I love the other stuff that I do just as much, like I really do enjoy that. I have fun with it. And that is important to enjoy and have fun. We spend so much time at work, so get something out of it.

Lauren:
In getting something out of it, could you maybe pass on some exclusive piece of information or value that would help our audience?

Susan:
I think in general people don't have enough belief in themselves. So, if you don't believe in yourself, nobody else is going to. Whether that's going for a job, whether that's starting a business, whether that's growing a business, whether it's changing careers, you have to believe in yourself first amongst everything else and be yourself too. Like so many, I think it's changing now, but when I was growing up and started in the corporate world, you had to be a certain way and act a certain way. And that really didn't fit with me. You know it's okay to be yourself. That's the secret sauce.

Lauren:
That's the secret sauce. So, yeah, hopefully we've been able to not only inform people about classification and the importance of it, and also enlighten them a little bit more about dirty data, the types of data, and looking at maybe methodologies that certainly help it, and keeping it consistent, organized, and accurate so that they can build up this trustworthiness. Hopefully we have provided a sort of insight into the importance of originality in a crowded space, I would say.

Susan:
Yeah. There is plenty of room for everybody.

Lauren:
Is there anything that you would like to share in the final words as we move to the close of our session, Susan?

Susan:
Yeah, just one more thing. Maintain your data. It's not good enough to just fix it or clean it. You have to regularly maintain it to keep it that way or there is no point in doing it in the first place.

Lauren:
Good advice. Okay.

Susan:
Nothing profound or, you know, thought provoking. Just please, maintain your data.

Lauren:
Maintain your data, people. Okay. Thank you, Susan.

Susan:
Thanks so much.

Lauren:
Thanks. So, as we close this podcast conversation today, I want to take the opportunity to thank you, the audience as well, for sharing such a valuable experience with us. And if you want to find out more about Squirro and the Insight Engine, then go to the Squirro Academy on Learn.Squirro.com and access our educational material.

Thank you.