Prototypes

The world of big data analytics, machine learning and AI is exploding at a pace that leaves many businesses unprepared.

The advantage of byte-sized learning is the opportunity to test new solutions to existing business use cases in a fail-fast prototyping environment.

Nobody has time to reinvent every modern analytics tool from scratch. But grafting an off the shelf tool onto an established tech stack can often uncover latent surprises that the vendor's proof of concepts could not be expected to cater for.

The use of prototypes helps to strike a pragmatic balance to arm procurement leads with the right questions to ask early on in the decision process. This is a powerful way to make a clear distinction between what we are buying and what is left to develop.

GraphESG

Pinecone RAG for an AI Sales Assistant

RAG

RAG, or Retrieval-Augmented Generation, is a neural network architecture that combines the capabilities of retrieval-based and generative AI models to improve the quality and relevance of machine-generated text. It is a useful way to give an AI assistant access to data that is relevant to a search query, in this case a situation where a customer might be looking for a music album or a book to buy.

Vector Embeddings

Knowing what is relevant from a machine's perspective means finding points in the database that are closer to the input text by some definition. To perform this search, text information is first translated and expressed as points in a high dimensional vector space, in a process known as vector embedding. The sentence "I want a soft acoustic rock album" could be translated into a list, called a vector, of more than 700 numbers. The translation method is set up so that texts with similar meanings map to vectors that are close together in a mathematical sense.

Pinecone

Pinecone is a well known vector database which, in this example, is used to store information about albums as vectors. It returns a list of the top most relevant albums, say 5, based on what the user asked for.

OpenAI

OpenAI has a ChatGPT API. This API uses the search results from Pinecone and explains in human language why the recommended album matches the user's query.

For example, try entering "I want an album about the sea", click Submit, and see if you like the recommendation!

AI Sales Assistant

GraphESG

Graph Databases for ESG

Basics

Graph databases such as AuraDB and AuraDS from Neo4j, offer a unique ability to model, store, and query relationships in data inherently and efficiently. This relationship-based structure makes it ideal to query the complex inter-relationships that often exist when understanding the effective investment holding of one company in another, especially if the investment is through an intermediary company.

In a graph database, the entities are known as nodes and the relationships between them are called edges.

For example, if Company A owns x% of the shares in Company B, which in turn owns y% of the shares in Company C, then Companies A and C have an indirect investment relationship. Calculating the effective investment of A in C, in a process known as look-through, can be challenging, especially if there might be multiple chains of investment between A and C. Each of the chains could have a varying number of steps, complicating efforts to define the calculations in terms of standard database operations that typically start from the assumption that the number of rows is fixed.

Investment managers looking to understand a fund's compliance with ESG (Environmental, Social, and Governance) principles have a more specific challenge. Rules exist to determine whether a particular company's operations qualify as ESG compliant. But multiple layers of investments in many companies, where some qualify as compliant and others do not, makes it no simple task to calculate ESG look-through, the proportion of assets in the ultimate parent fund that counts as ESG compliant, especially when the market values of the underlying investments vary relative to one another.

Graph Database using Neo4j and Bokeh

The model embedded below is based on a Neo4j AuraDB graph database and shows an implementation using 60 companies with 1,107 investment relationships between them. To see the effective relationships as a graph, select any two companies from the list and click Submit. By default, this returns relationships of up to two steps, even if there could be several more. Increase to 3 or 4 steps to see the additional relationships. From the two selected companies, the top left circle with a black edge is the investor, the one invested in the other, if such a relationship exists. At the bottom left is the investee of the pair, also circled in black. If no relationship exists, a placeholder graph will tell you so.

Simplifying what happens in reality, companies in the example are separated into -

Operating companies, which may invest in other companies and also have operations of their own, marked in green and
Asset managers, which only invest and have no significant business operations of their own, marked in red

The graph shows all the investment relationships (graph database edges) as grey lines. A useful feature of the visualisation program, known as Bokeh, is the ability to add hover-tools which display information when the user hovers the mouse pointer over specific areas:

Hover over a company to see its name, type and market value
Hover over a relationship to see the key metrics including investor, investee and shareholding

Finally, the subheading of the chart shows the proportion of the effective investment that counts as ESG compliant.

To see an interesting relationship, try Arch Yield Corporation, Vortex Prime Communications and 3 steps.

One drawback of this implementation is that the runtimes get very long when exploring paths that involve more than 4 steps. Understandably, the number of possible investment paths, and the number of computations required, increase very significantly if there is a need to consider 5, 6 or more steps. Despite the intention to be computationally efficient, Neo4j struggles with these, and this is on an example database that only has 60 nodes.

There is of course no such restriction on number of investment steps in reality. This is why the available steps in the model are capped at 4, and also why this model gets the Prototype status, not Production.

Note: Sometimes the database takes a while to start up. If there is no immediate response, wait a few seconds, refresh this page and try again.

ESG Graph

Tax Helper

A significant opportunity in Large Language Models (LLMs) like ChatGPT from OpenAI and Llama2 from Google is how easy it is to prime them with a specific context. Below is a relatively simple example to show the difference between two ChatGPT models. The first model is a generic ChatGPT and the second one is trained on the Self Assessment Manual from HMRC that is available at https://www.gov.uk/hmrc-internal-manuals/self-assessment-manual

When asked tax-specific questions the difference in their answers reflects the understanding of the additional context.

For example, in the HMRC manual, a D ref is defined as:

• The Office number, which consists of up to 3 numerals, and
• The Register number allocated to the taxpayer, which will be between 2 and 6 numerals and may be followed by one or two letters
For example 56/51051A.

Ask both of the two models below "What is a D ref?" and see the difference.

This subject matter awareness can be expanded to include very large volumes of text to help practitioners reach the answers they need more quickly.

To see more examples, ask "What is FINEST?" or "What is IDMS?"

Generic LLM

Tax Specific LLM

WELCOME

Welcome visitors to your site with a short, engaging introduction. Double click to edit and add your own text.

Generative BI

Have a conversation with a database

The ability to ask data-related questions through text prompts opens up many opportunities, especially for people without a deep programming or analytical background. Generative BI (Business Intelligence) is a powerful extension to more standard LLM applications.

For example, the Catalogue of Life is a fascinating super-resource of all living and extinct organisms. The chatbot below uses LlamaIndex to execute structured queries on a subset of this dataset, making it possible to answer simple questions such as the ones below:

"What is the common name for Panthera Pardus?"
"What is the scientific name for Cheetah?"
"How many unique animals have "Capensis" in the scientific name?"

The process actually uses two Large Language Models (LLMs), one to structure a Python query on a pandas dataframe, and the other to synthesize the query result into a text response.

One of the reasons this project is not production-ready is because it requires a very specific version of LlamaIndex. There are also some issues with hardcoded variables that are expected to be resolved as the code base for LlamaIndex matures.

With a few creative questions, the chatbot can be prompted to reveal parts of the underlying data directly. This is not a problem in this instance because the input data is in the public domain. Production applications on privileged information would need to be structured very differently, hence the Prototype designation.

Bank Manager Roulette

Binary classification for credit scoring

Welcome to Bank Manager Roulette! It is a simple simulation game that demonstrates some of the challenges and opportunities associated with consumer lending.

Try your hand on the inputs below, and if you're lucky your name will be added to the leader board!

Consumer lending aims to lend to customers at an interest rate that reflects the risk associated with their personal circumstances. The risk is often expressed as a score, known as a credit score. For example, Experian awards the worst score of 300 for the most risky profile and 850 for the best possible profile. Lenders have two basic choices to make:

What minimum score of applicants to accept, and
What interest rate to charge for each applicant.

Of course, lending to lower scoring applicants increases the risk of default, the situation where the customer cannot or will not repay the loan and the lender loses all or part of the repayments and interest they were expecting. If the lender can increase the interest rate, then with all other things being equal they will earn more in interest. But a higher interest rate increases the risk of default and also risks pricing the lender out of the market, especially if the applicant can borrow more cheaply elsewhere.

Consumer lending typically focuses on predicting three key risk components:

Probability of Default (PD): The likelihood, expressed as a percentage, that a borrower will default on their loan obligations within a specified period.
Exposure at Default (EAD): The estimated total value that a lender is exposed to at the time of a borrower’s default. This figure typically reflects the outstanding balance, including principal and accrued interest, at the time of default.
Loss Given Default (LGD): The percentage of the EAD that represents the actual financial loss the lender incurs if the borrower defaults, after accounting for recoveries such as collateral or repayments.

The game reflects this reality but makes a few simplifying assumptions:

It focuses only on predicting Probability of Default (PD) and ignores EAD and LGD.
If a customer defaults it assumes that all of the loan balance is lost and no interest or capital is recovered
It ignores the Time Value of Money (TVM) and makes no allowance for the potential value of interest that could be earned on repayments the lender receives over the term of the loan.

The game is based on Microsoft's LightGBM Classifier, a model that predicts whether an applicant will default or not based on a history of similar customers. A useful feature of a classifier is that it also provides an estimated probability of default which forms the basis of the simulation. Your challenge as player is to make the most profit from lending to around 180,000 potential borrowing applicants.

To play, enter three values:

Minimum Credit Score. In the simulation you will only be lending to applicants with this score or higher.
Extra Basis Points Interest. This increases the interest rate you will charge applicants. The standard interest rate in each application is between around 5% and 25%. Adding 1 extra basis point increases all of the interest rates by 1 basis point which is 1% of 1%. For example, 1 basis point increases 5% to 5.01% and 25% to 25.01%. Adding 100 basis points increases them by a full 1%, respectively to 6% and 26%.
Your name, in case you get a high score for the leader board.

The final point to note is that there is a catch. Just like in the real world, if you charge too many basis points extra, you will lose business to other lenders.

Click Submit to have a look at your results:

applications_withdrawn are customers who decided to apply elsewhere because you increased their interest rate too much
applications_granted is the number of customers who chose to apply, despite any increase you applied, and who met your minimum credit score. principal_started_millions is the millions of Dollars they borrowed in total.
principal_defaulted_millions is the amount you lost to defaults
interest_earned_millions is the interest you earned on applicants who repaid in full
portfolio_result_millions is interest earned minus principal defaulted, in millions

If you cannot see the leader board, please scroll down.

Good Luck!