Big Data for Retail Banking
Big Data for Retail Banking.
It covers areas such as:
- Individualization of product offers to existing clients.
- Early fraud detection and fraud damage mitigation.
- Prediction of products cancellations and client's defections.
- Optimal allocation of cash to ATMs and bank branches.
- Minimization of usage of expensive bank channels such as branch visits.
- Reliable assessment of clients for debt products.
Datasets from backups and relational databases are replicated into Hadoop. Machine learning technologies are applied to find hidden patterns and correlations in data.
Dataset of monthly expenses and income categories for all clients. This dataset is created from bank accounts movements, direct debits and standing orders. Each account movement is usually accompanied by movement code such as for electricity, phone bill, restaurant type code and so on. It also uses merchant's name, description and comment fields to categorize transactions.
We recognize several categories of expenses such as housing expenses (rent or mortgage), energy expenses (gas and electricity), food and household related expenses, education (schools, books, courses), car expenses (fuel and repairs), restaurants, big ticket items (TV, furniture), taxes, recreation and hobby, credit card and loan payments, luxury items and so on.
Income categories are salaries, dividends, tax refunds, social benefits, rental income, sales and so on. Simple regression analysis of this dataset gives us overall trends for total expenses, incomes and savings as well as detail trends for each category of incomes and expenses for each client.
Machine Learning and Predictions
We can use full range of machine learning algorithms and models to make predictions. There are two broad categories of algorithms supervised and unsupervised.
Supervised learning algorithms use historical data to learn that certain combinations of input values cause certain output values. Our models are trained and verified on samples of historical data. Sample data can be chosen randomly but we have seen better results if datasets are categorized first. The customers dataset has categories such as age, income, location based on town size, education and savings. Each category is split into brackets. For example age category is split into 20 five years brackets. We can see number of customers in each age bracket so we can sample 5% of records from each age bracket. These samples are ideal to see which categories make largest contribution to overall results. For example we can see that education makes largest contribution to accept certain investment products.
Unsupervised machine learning algorithms look for unknown patterns in available data.
We can use patterns of unusual clients' behaviour to find early signs of frauds.
Individualization of Product Offers
Banks can save money on broad and expensive marketing campaigns to promote bank products. Products will be offered only to customers that need them and are likely to accept them. Customers should see less of irrelevant offers. This requires deep knowledge of who accepted given products in the past.
Datasets of subscriptions to bank products and services as well as historical values are analyzed. Separate models for each product and subscription are created. We choose and verify the best learning algorithm and find which categories and variables do have the biggest influence.
Early Fraud Detection and Fraud Damage Mitigation
It includes detection of identity frauds, credit card frauds, wire frauds, attacks on internet and mobile banking and money laundering. New types of frauds and new schemes require flexible and fast detection algorithms. In past banks used only statistical and rules based algorithms to find whether suspicious activity is taken place. These algorithms were limited because they can only recognize known frauds, they require expensive maintenance, they do not work with full history of each client and they have high level of false positives.
We have utilized dataset of known fraud cases. Fraud cases were sorted into several categories such as overdraft fraud with stolen identity, stolen credit card, consumer loan fraud, credit card top up with fraudulent check, stolen checks, skimming with card duplication, attacks on online banking with stolen customer's credential and/or security devices, rogue online merchant frauds using credit cards and so on. Neuronal networks with back propagation were used as well as decision tree algorithms. These algorithms were applied on existing datasets to find unknown occurrences of frauds.
Prediction of Product Cancellations and Client's Defections
A prediction of bank products cancellations and client's defections is very time sensitive. Bank has just days to act before client irreversibly decide to cancel a product or move to competition. Bank needs to identify clients who are likely to defect, contact them and pro-actively offer alternative products or solve client's issues. It is much cheaper to retain highly profitable clients than to attract them back.
We have utilized datasets of account movements, debit and credit card movements, clients dataset from CRM, product subscription dataset, call centre and branch visits transactions and log information as primary data sources for predictions. We also utilized common datasets of incomes and expenses.
We have created timeseries of key events such as direct debits cancellations, incomes from salaries, dividends and rents, transfers to client's accounts at different banks, call centre and branch contacts made by the clients, cancellations of credit cards and so on.
We have selected another set of clients that do match categories such as age, income, saving and location for the same time interval but who still remain clients.
Based on these inputs we have created models that are able to predict behaviour of clients before they irreversibly decide to move to competitors. We have used several supervised learning algorithms such as Support Vector Machines for binary classification and Neural Network with Backpropagation for predictions. From unsupervised machine learning algorithms we utilized K-Means and Mean Shift Clustering after Principal Component Analysis was applied to reduce dimensions of input data.
We have identified several hundreds profitable clients in recent data who matched patterns of clients who moved their accounts to competitors. These clients should be contacted by their respective bank branches.
Optimal Allocation of Cash for ATMs and Bank Branches
Demand for cash is highly variable during year at many ATMs and bank branch locations. This variability is caused by weather, local events, vacations, tourism and so on. It is important to predict right amount cash that needs to be deposited into ATMs as well as bank branches. It is costly to service ATMs too often, it is also costly to have cash machines out of order due lack of cash. In the same time we wanted to limit amount of unnecessary cash that is stored for long times in ATMs and bank branches. It leads to suboptimal cash allocation as well as it attracts crime.
We have used ATM service logs, geographic locations of ATMs and bank branches, withdraws dataset for each ATM, weather reports for ATMs and bank branch locations, schedules of sports, cultural or other events as well as holidays for all locations. Other datasources included credit and debit card movements to assess demand for cash at various locations and during different times of the year. We have utilized common datasets of incomes to see when salaries, social benefits and other incomes arrived to client's accounts at different locations.
We have created dataset of median amounts of cash withdraws for each day of the year and hours of day for all ATMs. This dataset is used to calculate influence of weather, events, day of the week or holidays on demands for cash at given location.
We have used dataset of significant cultural, sport and other events during past 4 years with location coordinates. It calculates influence of each event on cash demand for all ATMs that are in 100m radius of given event. It is able to sort all events based on influence on cash demand. This dataset is used for predictions of influence of similar events.
We have also calculated correlation between local weather parameters such precipitation, temperature and wind at location of each ATM with cash demand.
We have created correlation dataset between days when clients receive incomes, such as salaries and social benefits and cash demands at different locations.
It creates models that can predict cash demand for each day of the year for each ATM and bank branch location. These models take into account historical weather forecast data and schedules of events. It utilizes algorithms such as Restricted Boltzmann Machine, Perceptron and Gaussian Discriminative Analysis.
Minimization of Usage of Expensive Channels
We have contributed towards minimization of usage for expensive bank channels such as over-the-counter operations and other visits of bank branches as well as calls to call centres.
This can be achieve by optimizations of online banking and mobile banking applications, help pages and wizards as well as optimization of web pages on bank's websites. Another way to encourage reluctant clients to switch to cheaper channels is by targeted campaigns.
The primary sources of data for analysis are web log files from online banking application as well as mobile banking applications. We have used bank accounts movements with codes of bank channels, dataset of call centre transactions, CRM dataset with information about customers and dataset of transactions from bank branches.
Another important dataset is complains and enquiries from call centre, emails, letters and branches. We have sorted this datasets by areas of interest and correlated with help web pages. It is able to identify help pages that are unclear and caused confusion and unnecessary calls to call centre. It also identifies operations in online banking that are complex and generated higher amount of complains. It uncovered several areas related to exchange rates during credit cards payments that were not covered by help pages but were often discussed over the phone or even by bank branch visits. Changes made to bank products related web pages, self helps, search optimizations, online banking operations and mobile banking applications can bring quick savings on outsourced call centres and bank branch visits.
We have analyzed results from marketing campaigns to move reluctant clients to online and mobile banking or self-serving kiosks. It used correlation analysis and uncovered that some broad marketing campaigns were not efficient. We have analyzed patterns of bank clients who recently moved most of their operations online. This gave us a tool to select portion of clients that are more likely to move online. These customers should be targeted by personalized marketing campaigns or by demonstration of advantages at bank branches.
Assessment of Clients for Debt Products
In order to reliably assess risks and approve debt products to existing clients we need take into account not just current credit scores and current disposable income of the clients but also complete history of the client as well as social context. This decreases risk for the bank as well increase income from valuable clients who would be otherwise rejected.
We have used common dataset of incomes and expenses, complete history of payment morale for credit cards, consumer loans, mortgages, overdrafts and other debt products and CRM information about clients.
It uses Markov Chain stochastic process to assess debt and payment morale related behavior of clients. This model was tested on historical data of profitable and defaulted loans, credit cards and other debt products. We have noticed improved reliability of credit scores and we were able to suggest suitable alternative debt products for rejected clients.