How Can ‘Data Integrity’ Impact AI Product Performance?

Data Integrity

Since the explosion in popularity of AI due to the launch of ChatGPT4, there has been a lot of chatter around the concept. People are asking a lot of philosophical questions about the future and AI’s impact on it, and therefore it’s important to be as ethical as possible with the concept. This starts with feeding the AI or machine learning systems being used and developed quality, accurate data.

In fact, scientists spend roughly 80% of their time analysing and testing the data they have in order to meet responsible AI guidelines in preparation for integration with AI systems. Why is this? Why is the integrity of data so important to the AI and ML process? We explore here.

Why is data integrity important?

Data is the knowledge that you feed the machine learning system, so it’s important that it is accurate in order to get the correct outcome. That’s the biggest issue here: getting the desired outcome. Getting the wrong outcome would have a lot of knock-on effects. It defeats the purpose of artificial intelligence for one thing, but it can have detrimental effects on the company or business you are making the ML model for, affecting sales, business operations and more.


Another issue is one already affecting the internet since its inception: misinformation. A big criticism of ChatGPT and its competitors, for example, is that it is informed by the internet and the internet is already full of wrong data. If what the AI says is then taken as gospel, even more misinformation will be added to the internet – and as we’ll go on to discuss, AI adds to it at a much larger scale.

Garbage in, Garbage out

This is all part of the GIGO principle: you put Garbage In, you get Garbage Out. The issue is that AI and machine learning models with no data integrity offer the GIGO principle on scale. If you put all your customers with unfulfilled or wrong data into your ML model, you’re going to get false results that will affect every business decision.

For even these initial issues concerning AI, data needs to be carefully considered before it is fed to the machine. 

Types of data integrity

So, what types of data integrity issues are there? What is a data scientist actually looking for? Well, there are three main types.

– Missing value

Missing value means that a factor is missing from the data. For example, if your system is taking in customer data and an option from the forum wasn’t filled in like age, address, gender, etc.

– Range violation

Range violation means that the data that was offered is out of bounds or is a known error. An example of this would be if a customer were to input that their birth year was in 1023 rather than 2023 or their birth month is the 13th.

– Type mismatch

Type mismatched simply means that data of a different type is put in, like putting numbers in the name option.

These are very basic options to understand the simple concept. You can find and avoid issues like these by predicting missing values, setting default values, or even doing nothing. Bad data might resolve itself once it surfaces upstream or downstream in the system.

Leave a Reply

Your email address will not be published. Required fields are marked *