Models. Representation. Understanding.

Understanding “understanding”.

13 min readSep 30, 2024

Lecture notes from the Computational Thinking course at IBA, Karachi. Week VI.

Before we dissect how machines and models understand the world, we take a deeper look at how we understand something. This is our second and final look at representation, understanding and model building.

Representation?

In earlier lectures we defined representation as an approximate or crude model. Representation is a tool we use to simplify, document, encode and understand a specific process, a set of rules, or relevant context. For instance, a scrabble board is a 15 by 15 grid. A chess board is an 8 by 8 grid. A dictionary is a list of words. A searchable dictionary of words is more efficiently stored and accessed within a DAWG (Directional Acyclic Word Graph). A board evaluation algorithm is an Excel spreadsheet. Rules of a game can be summarized in a truth table for easy reference, simplification or even reduced form. These are all instances of representations that we used in the last six weeks.

Sample representation. From Top left to bottom right. A DAWG, Two Tries, Two histograms, a Truth Table, Scrabble Board as a 15 by 15 grid.

We looked at representation and model building with multiple perspectives. Our focus? To explore computational thinking and see how wide the scope of these two words is. It is now time to put that understanding to work. To understand, “understanding”.

Today, we will do that by dissecting a framework for playing with a dataset and picking what a specific collection of data has to offer. Offer in terms of building a view of the world. Our hope is in doing this exercise, we will also better understand how we understand something.

How do we, artificial intelligence, models, or machines understand the world? We understand the world by building crude models that approximate behavior. We use historical behavior to test fit of our model. Here fit is defined as how accurate our models track history and behavior. Historical and future behavior are proxies for reactions we expect to observe if historical relationships implied by the dataset hold true in the future.

How does model building works?

Model building is a process with multiple goals. One goal is to determine how what we observe in the world works. In one way it is an extension of the representation process. How large language models encode information, how search engines build a map of the searchable space that hides behinds nodes, pages, directories, files and networks on the internet.

Dissecting datasets and modeling representation.

For our purposes and within the context of our course, model building has four components. These four components represent attack vectors we use when we want to own a given dataset.

a) Data

b) Assumptions

c) Relationships

d) Questions.

Data

For the purpose of our exercise we look at the State Bank of Pakistan (SBP) Payment Systems dataset (PSD). Data is explicit. It is clear. It is visible. We can see it. We can review it. We can look at it. We can analyze it.

Within the context of midterm prep, our dataset is quarterly release summary for 31 quarters covering the years 2016 to 2024.

To become one with the data, we have to go beyond the field definitions and a cursory review individual or collective rows. We have to understand how this specific data is generated. Using Taleb’s words, understand the generator function.

Assumptions

Our second attack vector is assumptions. Sometimes for engagements, we are given assumptions with respect to a dataset. In the case of the State Bank dataset, we have an assumption that bank fees charged on mobile banking transactions are 0.1% of the amount transferred. This is a working assumption. The real fee structure is more complex. This given assumption is an explicit assumption.

A second set of assumptions deals with age and income distribution of mobile banking customers in Pakistan. This is also explicit. An implicit assumption is that the national income and age distributions, which are the basis of the data shared with you, hold true for mobile banking customers in Pakistan. It is implicit because we don’t have direct age or income distribution brackets within the SBP dataset. So, we borrow the national ones. We are assuming that national distributions are applicable or a reasonable approximation to the true distributions.

The four assumptions apply to fee income generation, age and income distribution of customers. Using the remaining components of model building process, we determine how to use data and assumptions together to get to where we need to be.

Relationships

The third step of the building model process is relationships. Just like data, some relationships are explicit. Column A + Column B equals Column C. Other relationships need to be understood or derived or extracted from the dataset. This relationship extraction piece determines our ability to understand how the world works and is put together.

For instance, how are mobile banking fees per use calculated. We first estimate the total mobile banking fees for Pakistan’s market per quarter. We then divide them by the number of active users, which gives us quarterly fee. We then multiply this number by 4 to get the annual number. In Excel this is represented by a formula or an equation. The equation expresses the relationship.

In statistical and intelligence terms, when we identify and extract a relationship which was previously either not visible or not understood, we extend our knowledge and awareness of the dataset. We call this process feature extraction. Feature extraction helps us determines what can forecast (describe or predict) future behavior. What are the relevant descriptors that can help us understand future behavior? What are the relevant relationships that would help us identify, build, forecast, and trend the model?

For instance, what determines growth in ecommerce GMV? Is it the number of vendors, the number of customers, average order value, the USD:PKR exchange rate or mobile banking users or some combination of all of the above?

Take the USD:PKR exchange rate. When we are measuring and reporting results in USD terms, it is one of the most important driver of results in our dataset. Why? Because we use it to convert all relevant fields in USD terms. If the USD:PKR exchange rate falls to PKR 250 to 1 USD tomorrow what would happen to ecommerce sales in USD terms? Would they increase or decrease? While in PKR terms they may not change immediately, but in USD terms, they would rise because PKR as a currency would get more value in USD terms. This is an explicit relationship. The implicit relationship of this factor is do ecommerce sales change when the USD:PKR rate drops or falls to 250? The short answer is also yes. Because most ecommerce items are imported. They become more affordable as the exchange rate shifts favorably.

Questions

The fourth part is the most important part of the model building process. For it defines how we build the model and what the model will or will not do. It is the list of questions the model will answer. If we don’t know the questions we are supposed to answer, we can’t build the right models.

The four components together determine the framework we use when we are given a dataset.

a) Data

b) Assumptions

c) Relationships

d) Questions

Two hats, four lenses.

For your mid term examination, apply these four pieces together with respect to the SBP PSD dataset. When you do, you need to wear two separate hats.

The first hat is the representation challenge. How do we understand this dataset? What is the context? How do we become one with the distribution? Perhaps generating the distribution may be a good starting point. (This is where you go checkout the generating a histogram tutorial shared on the Discord server.)

What makes this specific dataset interesting and useful from an assessment perspective is that there are multiple distributions. Not one. They are not highlighted but you can identify them once you start playing with the dataset. How do these distributions interact with each other? Which distribution should we start with?

The second hat is the signal compression hat. All data represent a mix of noise and signals. What are those underlying signals for the SBP dataset? What are we trying to discern, determine or detect?

Why is compression important? For a given question there is a reasonable chance that we are not going to use all the information available. We are only going to use some of the information to answer a specific question. Building this map is important so we are not overwhelmed by the signal to noise ratio during processing.

For example, take income distribution by age bracket. Income distribution by age brackets looks at the breakdown of a given age bracket (20–24) by income segments (segment 1 to 9)

How is this different from age distribution by income brackets?

Training vs inference

Which one do you use for which question? Data can be cut or sliced different ways to answer different questions. We don’t have time to build these maps in real time. Which is the reason why we have two cycles. A training cycle which we refer to as pre-work in this course. The training cycle, class work, lectures, assignments, tutorials represent the map building process. When we use these maps to answer questions, we start the assessment cycle which maps to inference within the world of large language models.

Once we have identified this map, the next question is how do we use that information? Which of the four lenses do we need to answer the question that has been posed or is likely to be posed. Remember the four lenses from the I in BI lecture.

a) First order. (Presents information as is.)

b) Second order change. (Digs deeper by examining the trend of change.)

c) Distributions. (Collate patterns of change over time.)

d) Trend lines (Help forecast or prepare an outlook.)

Questions fall in multiple categories. One set of questions is how do things work? These are relationship questions. Do you understand how things work? For instance, how are mobile banking fees generated? What are the components that determine Ecommerce GMV for a given quarter in USD terms?

A different category deals with future forecasts or predictions. What is the likely mobile banking revenue or ecommerce GMV prediction for next quarter? What the model says about the future is a function of history, relationship, and trends. If we ask, what is your forecast? The forecast is a trend question. We build a trend line and use it to make a prediction

If we ask, what is the probability that this would happen, we are looking at a distribution question. Distribution questions often follow prediction questions. Is this event likely to happen? Yes. What is the probability of that yes? To predict and assign a probability, you to build a trend line and a histogram.

If we ask, what if we change this and do this, what would that do to the results, we are asking a relationship question. If we ask what is the likelihood of something happening, a distribution question. What is the outlook for the next quarter, a trend question.

A good model or model builder should be able to answer all three questions. When we can do this, we say that we built a model which is representative of the world. It’s functional enough for us to use it.

Models are not perfect. They are an approximation. We understand that. But functional models, despite being approximations are working models. A good enough model to answer questions that we need to ask. Essentially, this is exactly what large language models attempt to do. Also, what search engines do when they build a map of the world. This is what you must do when you examine your dataset.

All models are wrong. Some models are more useful than others.

Why is this exercise hard? There are two reasons.

One, we have never emphasized these aspects of model building in your question sets or in the classroom, as yet. Other than representation and basic model building, this is new. It is also a context switch. Moving from discussing representation to actually doing a representation exercise consciously. We do representation every day when we read, learn, observe or model something. But we do it without thinking about it. By making you slow down and observe what you are doing, the exercise becomes infinitely harder.

So far in your assessments, you were given sufficient information within limited dimensions that you could use to answer questions. In this specific instance, you’ve been given that information in separate pieces, but the noise within those pieces hasn’t been removed. There are also multiple dimensions to the exercise. You need to learn to filter noise out and collate the signals together to answer the questions you need to answer.

Two, you are not familiar with the process of unstructured model building. When you build a Lego set, you have a step-by-step guide. That step-by-step guide is missing with respect to the SBP dataset. That is by design. The dataset is structured, but the questions, assumptions, relationships have been removed from your shared context.

Sometimes it is important for us to figure out where should we start on a project on our own. This is one of those times. It doesn’t matter whether we build them in Excel, whether we write a program, whether we do a database query, whether we draw it on paper, whether we build it with pieces of wood, nails or hammer. If we don’t know where or how to start, we are not going anywhere.

Take the example of the HBL mobile banking fee estimation exercise that we covered tonight in our tutorial. The average mobile banking fee per user in 2024 was $9 per user per year. Is the HBL average higher or lower than average?

How would you answer this question? We answered this question by looking at the SBP PSD dataset to calculate the average fee, the mobile banking application market share to calculate the distribution and the income distribution to calculate the weighted average fees. We knew how to do this because we had the data available and I showed you how. What if it wasn’t? And I didn’t? How would our model design change?

Why?

Much more than a model, what matters most is that you’re trying to build a model. Why? Because this is exactly what we do when we write a program. Programs are also process based. Input, process, output. Representation, our ability to understand the world, is also linked to our ability to build models. And by extension our ability to write programs that work, that last, that change the world.

Some models are straight forward, clear and explicit, like our board evaluation Excel spreadsheet for determining the probability of winning for a player. Others are implicit. When you look at them, you try and understand how things work. The SBP dataset is one such model.

Intent matters. What is our intent from this specific exercise?

a) To get you comfortable with the model building process, get you comfortable with the representation process, so that you understand how hard and important good representation is.

b) Build up on the momentum on group collaboration so you and your group are ready for the next stage of the course that focuses on building a 16-bit CPU.

c) To get you comfortable with looking at a world from a systemic view. Where like Neo in the Matrix, when you see the world, you see all the underlying parts that come together to make it work.

Why the SBP dataset?

Why the State Bank of Pakistan’s payment systems data set? Payment systems represent the single largest technological intervention in financial services sector in Pakistan across 30 years. Collectively they changed the profile of this country. Starting off with ATM, then nationwide switches, then banking middleware, then internet banking, then mobile banking, and now digital banks. It is layers upon layers of technological infrastructure and software. When it comes to the financial services sector, (FinTech, InsureTech, RiskTech, or GovTech), we generate an enormous amount of data every day. We also require resources for analysis, forecasting and predicting capability. Read: Employment opportunities.

Why Excel? Within your context, in the first semester, Excel is the next best thing to a programming language. To teach problem solving, sequential thinking and structured model building and representation skills without actually learning how to program.

Like programming, Excel makes it possible for you to trace the work you’ve done and the impact that work is having. Like a huge giant scratchpad that allows you to generate results and allows you to trace and see how those results have been generated from one step to the next step.

There is a third bonus element. If you know how to work directional analysis (income vs age distribution), forecasting (trends), Bayesian analysis (distributions), tabulations (Pivot Tables) or think better than your peers, it is that much easier to lock that summer internships in ’25.

Real world and real work is complex. The sooner we get comfortable with doing complex work, the faster we can take on the real world. Best of luck for your Sunday assessment.

The computational thinking tutorial lecture on representation, understanding and analysis.