What is Probabilistic and Deterministic data?

Deterministic data: Information about people that is known for sure

Deterministic data is digital facts about people that we trust are 100% true. Crucially, these facts will never change and the probability that they are true will always be 100%, thus they provide a solid foundation for a multitude of applications in online marketing. For example, if we know from a reliable source that a person was a 20 year old female last year then that will always be true. We can even be clever and deduce that this year the person is a 21 year old female. Knowing a person’s true age and gender is certainly of high relevance to online marketeers. Going beyond basic demographic information, deterministic data can take infinitely many forms, such as a person’s interests, friends, geographical whereabouts etc. In practice, all these facts are linked to something that identifies a person, such as an email address or a cookie ID, which then becomes the real lingua franca of the online marketing industry.

Why is it important to have deterministic data? In a nutshell, deterministic data form a “ground truth” about users that is both useful on its own and has many important downstream applications in online marketing. On its own, we can use deterministic data to create granular custom segments. For example, we can create a segment of people who we know share an interest in golf. Now, we could go ahead and target these golf enthusiasts with relevant online campaigns. The more deterministic data we have, the larger segments we can create.

Another use case for deterministic data is campaign validation. Let’s look at this use case in more detail. After a campaign has ran its course, online marketeers may ask themselves whether the campaign was successful. Was it able to reach its intended audience? What was the ratio of hits to misses? How did the campaign perform with respect to the target group on individual publisher websites? All these questions can be answered if we have deterministic data for a sufficiently large subset of the exposed users.

Finally, prediction is yet another important use case for deterministic data. Prediction involves making educated guesses about a user property that we do not know from our deterministic data. For example, we might try to guess the age, gender or interests of a user in order to create probabilistic segments. Prediction is great and a necessity, but it is also a source of inaccuracy. The more deterministic data (stuff you know) you have as a training set for your algorithms, the higher combination of accuracy and reach can theoretically be achieved, leading to more impressions you will deliver on target. After you train a probabilistic model, you also need to validate if the model was successful or whether it requires more tweaking. In other words, you can have all the behavioural data the internet has to offer, but without a solid base of deterministic data you are unlikely to deliver precision in your predictions. Many publishers will nod in disappointment to this, as they have experienced how their data products/partners were unable to help their business in the way they expected. Without a large volume of deterministic data to validate your model up against, you are flying blind. This is why trying to predict audience segments based on behavioural data alone or small pools of first party user data (e.g. 1000 user surveys) makes it very hard to generate reach without compromising on precision.

You may ask yourself where all this deterministic data comes from? The answer is that deterministic data comes from a multitude of sources, which include online questionnaires, e-commerce sites, and social media. For example, web sites frequently ask their users to fill out questionnaires with details about their satisfaction level along with demographic information. E-commerce sites collect facts about people over time, such as the items they have bought and their shipping details. Social media encourage people to share facts, i.e. deterministic data, about themselves, such as their interests, employment history, and education level. All this data flows into a pipeline of deterministic data that is exchanged between different platforms on the internet, either directly or via services that are derived from the data. Crucially, we must remain critical of the sources from which deterministic data is gathered, since we promote this data to the level of digital facts about people with big consequences for targeting, campaign validation, and algorithmic segment creation.

In conclusion, deterministic data forms the valuable “ground truth” about the online population, which all other applications in online marketing are based on, that is unless we are willing to guess at random. While deterministic data offers value on its own, e.g as the basis for granular custom segments, it also forms the foundation for applications such as campaign validation and probabilistic segments, which potentially offer much bigger reach than deterministic segments. We gather deterministic data from a multitude of reliable online platforms that range from e-commerce sites to social media and questionnaires. We help publishers and agencies validate campaigns, create custom segments and predict with precision by providing high quality deterministic data panels.

Probabilistic data: Information about people derived from mathematical models

Probabilistic audience data is usually based on behavioural data like web-logs that are aggregated and analysed in order to determine the probability that a user belongs to a certain demographic category or class. Advanced algorithms try to identify distinct behavioural patterns like certain travel and browsing behaviours in order to determine the probability of the user being male or female, young or old, etc. Many behavioristic models are in fact searching for distinct patterns of known human behaviour. Patterns that usually emerge due to humans being creatures of habit.

  • Some audiences are more likely to consume sports- and motor-news
  • Some audiences are more likely to be online at certain times of the day/week
  • Some audiences own and use certain types of devices

All these habits create distinct behavioural patterns that often can be identified algorithmically in anonymised log files. The advantage of using probabilistic modelling is the ability to scale your models since you no longer have to rely on first party interactions and people providing you with their profile information as well as login information like usernames and e-mail addresses. As long as we ensure that the correct permissions are obtained, a user does not need to login and provide you with personal data before online behaviour can be observed, logged and algorithmically matched to a specific demographic target group.

While the true strength of the probabilistic approach lies in its ability to scale, it’s inherent weakness is often a lack of deterministic data to actually validate the accuracy of the model with. The question is: How do we know that our model is right? The answer is: We can validate predicted profiles if we have “ground truth” for a sufficient subset of them. For this reason, deterministic and probabilistic data are complementary.

Probabilistic modelling does not operate in absolutes, but provide classification with a degree of certainty. Validation is in other words needed in order to document the effectiveness of any probabilistic derived audience. This is why AudienceProject has chosen to deploy a combined approach where behavioristic modelling is used to classify anonymous users into demographic classes, while the deterministic data is used for testing the accuracy and precision of the models and to improve our behavioristic models iteratively. This approach gives us the benefit of high accuracy levels combined with massive scale.

Read more in FAQ

At helpdesk.audiencedata.com you can read our FAQ and learn how to setup and use AudienceData. If you find that something is not covered in the FAQ, please reach out to us at support@audiencedata.com – we’ll be happy to answer any question you might have.

Get started

Take your campaign performance to the next level today

Browse Segments