Credit Risk Management with SAS Viya 3.5

TAN JIA YUN _
12 min readApr 18, 2021

--

Introduction

In this article, we will be exploring how SAS Viya can be utilized to perform the different stages of the analytical lifecycle. The modules from SAS Viya that will be used in this article includes:

  1. SAS Visual Analytics: Explore data by creating smart visualizations (e.g. charts)
  2. SAS Data Preparation: Perform data transformations, such as joining tables, transposing columns, creating calculated columns, etc.
  3. SAS Visual Data Mining and Machine Learning: Perform data mining, build machine learning models and analysing results.

Business Problem

Due to the 2008 global financial crisis, borrowers stopped transacting with banks, causing many banks to be forced into foreclosure as they were over-leveraged. Since then, banks all over the world have to abide by the Basel II guidelines to manage their capital for lending.

To have enhanced risk management and risk sensitivity, the bank would like to build a robust credit scoring and credit risk prediction model that will enable them to approve new loans, identify and manage existing customers who are at risk of default, and control the capital the bank has for lending.

Dataset

The raw dataset consists of 26 columns and 3000 observations, of which 1500 observations belong to target variable 0, and the remaining 1500 observations belonging to target variable 1. GB is defined as the target variable, where 0 represents the good customers and 1 represents the bad customers.

A frequency variable is identified in the dataset, and it is directly correlated to the target variable. Target variable 0 is assigned to a value of 30, while target variable 1 is assigned to a value of 1. The dataset will be expanded according to the frequency variable, which results in 46500 observations in total.

Exploratory Data Analysis (EDA)

In this section, we will be exploring how SAS Viya: Explore & Visualize can be utilized to perform EDA. Several useful insights were generated from EDA.

1. Class Imbalance

The number of good customers (“0”s) far exceeds the number of bad customers (“1”s). 96.7% of the total number of observations belongs to GB=0, and the remaining 3.3% in GB=1.

2. Skewed Variables

Within the dataset, there are two categorical variables, LOCATION and RESID, with a skew of more than 90%.

For interval variables, the team has identified two highly-skewed variables, namely, CASH and INCOME.

3. Scale

Interval variables in the dataset are found to be of vastly different scales. For example, AGE, TMJOB1 and TMADD have a lower value scale compared to variables like INCOME and CASH.

4. Correlation

There are three pairs of variables that are highly correlated to each other — CHILDREN and PERS_H, DIV and REGN, BUREAU and LOANS.

5. Missing Values

Missing values are identified in RESID, PROF and PRODUCT.

6. Outliers

While exploring the data, the team found outliers in INCOME, TMJOB1, TMADD, CHILDREN, PERS_H, LOANS, and CASH, where a small number of applicants have unusually high numeric values that far exceeds the other data points

Data Preprocessing

In this section, we will explore how SAS Viya can be utilized to perform the preprocessing steps easily. The splitting of comma-delimited string will be performed using Prepare Data, and the remaining using Build Models.

1. Splitting Comma-Delimited String

The PRODUCT variable contains comma-delimited data, and thus, the variable will be split into three additional variables (Product 1, Product 2 and Product 3). This step can be performed automatically in SAS Viya: Prepare Data. The original PRODUCT variable will be rejected.

2. Data Partitioning

Data is split into training and validation datasets, where 70% of the original data is used for training and the remaining 30% for validation. The training dataset will be used to train the model, while the validation dataset will be used to check for robustness and prevent overfitting of the machine learning model.

3. Event-based Sampling

To resolve the issue of an imbalanced dataset, the team will be performing event-based sampling to reduce the unequal class distribution. Events will be assigned at 25% and non-events assigned at 75%. With SAS Viya’s event-based sampling technique, the majority class will be undersampled to adhere to the distribution specified.

4. Removal of Highly Correlated Variables

Between variable CHILDREN and PERS_H, CHILDREN will be rejected as PERS_H may contain more information. While both DIV and REGN contain similar information, DIV will be rejected as it is more highly skewed compared to REGN. Both BUREAU and LOANS will be kept as these two variables do not have similar information and may contribute to predicting credit default rate.

5. Removal of Outliers

Outliers are influential observations that may lead to inaccurate models that do not generalize the dataset well. Thus, all outliers in the dataset will be removed to ensure that it does not affect the model’s accuracy.

6. Transformation

The team will be performing log transformation on interval variables to reduce the variability in the given data. The log-transformed variable will follow a normal or near distribution.

7. Standardization

Z-score standardization will be performed to rescale the distribution of interval values, allowing the mean and standard deviation of observed values to be 0 and 1 respectively.

8. Missing values

We will also perform imputation on missing values to replace the data with substituted values. The PRODUCT and PROF variable will be imputed with a value of “(none)”, which signifies that a customer does not have a profession, or that a customer did not purchase the product. RESID will not be imputed as it was removed from the dataset due to high skewness.

The final dataset consists of 6000 observations, with 1500 (25%) bad customers and 4500 (75%) good customers.

Analytical Models and Results

Next, we will build the analytical models and analyse the results.

1. Credit Risk Profiling

In credit risk profiling, SAS Viya’s clustering node will be utilized to build the k-means clustering model. Clustering is an unsupervised learning problem, which means that there will be no target variable to predict.

The different cluster initialization methods, Forgy and Random, will be experimented with, and the team will utilize the initialization method that produces the best score. The team will set the minimum number of clusters to be three, and the maximum number of clusters to be a range from five to seven.

The optimal number of k clusters will be evaluated based on Gap statistic, which is an evaluation metric that calculates the difference between the log of within-cluster sum of squared errors for the observed and reference data point.

A segment profile node from Viya will then be attached to the champion clustering node, which examines the segmented data and identifies factors that differentiate the segments from the population.

Results

K-means clustering with forgy method of initialization was chosen as the champion model. The highest gap statistic was achieved at 3.96, with k=3 clusters.

The three clusters are profiled based on the factors identified by the segment profile node. The team will also be analyzing the default ratio of each cluster, which is calculated using the formula shown below.

Matured Adults (Cluster 1, default ratio = 0.2734)

In this cluster, there is a significantly higher percentage of customers who are aged between 30 to 40 years old. They have the highest income compared to other clusters, and they have a household size of two to four people. It can be seen that they are holding a high number of loans outside of the bank, and only half of them have finished paying off their previous loans.

The team concludes that this cluster consists of adults in their later years who have already established their own families, and have a relatively high and stable income due to their work experience. They may have taken on additional loans for their children’s education, or for investment purposes.

Students and Fresh Graduates (Cluster 2, default ratio = 0.2652)

Within this cluster, observations are aged between 19 to 25 years old, and they have the lowest income compared to other clusters. They are generally the only person in their household, and they have a low number of loans within and outside the bank. The majority of them have not finished paying off their loans.

The team deduces that these are young adults who are either still students, or are just starting on their first jobs. Naturally, they would have low income due to lack of work experience, and most of them are not married with families yet. They may have student loans.

Young Adults (Cluster 3, default ratio = 0.2076)

In this cluster, the team sees an age range of between 25 to 30 years old, with middle income. They have a household size of one to two people, and have at least one loan outside the bank, and multiple loans within the bank. Most of them have finished paying off their previous loans.

These are young adults who have financial stability and are newlyweds with no children. They may have the lowest default ratio as they do not have additional financial commitments after paying off their student loans.

2. Credit Risk Prediction

For credit risk prediction, the role of target variable will be assigned to variable GB, with GB=1 being set as the target event level. Additional features from the clustering model, such as distance to centroid, will also be added to the dataset. A baseline model using logistic regression with default settings will be used as a benchmark for comparison of performance between models.

Feature engineering will be performed to create additional features that may possibly improve the accuracy of the model. The automated feature engineering template from SAS Viya will be utilized to create the new features. The template includes best transformation, autoencoder, principal component analysis (PCA).

A variety of classifiers available on SAS Viya — gradient boost, decision tree, forest, support vector machine and neural network, will be applied to the feature engineered dataset and the results will be evaluated against the baseline model. Hyperparameter tuning and autotuning of models will be explored as well.

F1 score is selected as the evaluation metric because it takes into account both precision and recall.

Results

The champion model selected by the team is the Gradient Boost model (autotuned) with the best transformation variable from feature engineering. This model achieved the highest F1 score of 0.6632, Area under ROC (AUROC) of 0.8480, and an accuracy of 0.8578.

The two other variables created by PCA and Autoencoder in the feature engineering step were unable to produce good model performance. As PCA relies on linear relationships between variables, the team hypothesized that it was unable to achieve a better model performance because the dataset may have a nonlinear data structure. Thus, nonlinear relationships within the variables cannot be captured by the PCA model. In autoencoder, the neural network layers may not be deep and complex enough to capture all the required information, and thus it did not manage to perform well.

3. Credit Scorecard

In credit scoring, the target variable will be column GB. Using the Interactive Grouping node in SAS Viya, all of the independent variables (e.g. age, income, etc.) will be transformed using the weight of evidence (WoE) method, which attempts to find a monotonic relationship between the input features and the target variable by splitting each feature into bins and assigning a weight to each bin. Suppose a WoE transformation on age included ages between 23 and 28, then all observations within that bin would receive the same WoE value which can be computed using the formula below:

This calculation is automatically done in SAS Viya. The WoE transformation also provides the information value (IV) for each row. Information value (IV) measures the predictive power of independent variables which is useful for feature selection. Strong features typically have an information value of more than > 0.3, weak features < 0.02, and anything above 0.5 is suspicious. For the analysis, variables with an IV of less than 0.1 will be dropped.

The team will fit the model using the newly transformed WoE dataset. The default modelling technique used by SAS Viya’s Scorecard node is Logistic Regression. Therefore, logistic regression will be used to predict the likelihood of customers being either good or bad, and the scores will be used to profile them into risk bands. Upon fitting the model, a column of logistic regression coefficients are generated. Since the scores are in log odds, the scores will have to be converted into score points through scaling. SAS Viya automatically scales the logistic regression coefficients into score points.

Results

For this scorecard, a cutoff score of 600 points is set. The points will be scaled such that the total score of 600 points corresponds to good/bad odds of 50 to 1 and an increase of the score of 20 points corresponds to a doubling of the good/bad odds. The choice of the cutoff score will not affect the predictive strength of the scorecard. SAS Viya generates a scorecard based on these values.

Looking at the score distribution of the scorecard, customers will be segregated into different risk bands based on their credit score. The top 20% of loan applicants will fall under the ‘low risk’ band, the next 20% will fall under ‘medium risk’, and very similarly, the next 20% will fall under the ‘high risk’ band and the bottom 20% will fall under ‘extremely high risk’.

The model is evaluated using AUROC score. The scorecard model has an AUROC score of 0.6463 on the training set and 0.6117 on the validation dataset, which indicates that the model has moderate predictive accuracy

Conclusion

With the aid of the clustering and prediction models, we hope to enhance risk sensitivity and management for the bank; through not just prediction scores, but actionable insights for more efficient capital lending. Here are some specific insights that the bank can obtain from the results:

Credit Risk Profiling: Banks control the amount of money that each customer is allowed to borrow. For instance, if a client falls within the ‘Matured Adults’ cluster, which is the cluster with the highest risk of default, the bank can reduce the amount to lend to the customer, and apply stricter loan approval criterias and guidelines. The clustering model enables the bank to begin their credit risk management right from the beginning of a customer’s journey, allowing them to customize risk strategies based on clusters. This also allows the bank to adjust their risk appetites based on market conditions.

Credit Risk Prediction: Defaulting on loans is common in our society. This model helps the bank identify loan applicants with high risk of default. With this information, the bank can choose to approve or decline the loan for such customers, maximizing profits and minimizing loss. With credit risk prediction, the bank can also better manage their capital requirements based on the Basel II guidelines.

Credit Scorecard: The scorecard provides a plethora of opportunities for banks, especially with regards to lending. This credit scorecard provides a risk band for each good or bad credit customer, which allows for greater risk management and risk sensitivity for the bank.

A customer’s demographic information and credit history may change over time, which means that customers may also fall in and out of the risk bands as time passes. The scorecard allows the bank to track changes in a customer’s profile easily and effectively, allowing them to adjust their risk levels regularly.

Future Work

Kernel Principal Component Analysis (KPCA), a nonlinear extension to the PCA technique can be utilized during the feature engineering process of the credit risk prediction model. It would be insightful to explore whether this technique would be able to bring about better results compared to the ordinary PCA method.

Instead of being in a separate pipeline from the clustering and credit prediction models, the team can develop a scorecard based on the default probability values obtained from the credit risk prediction model in the future. The scorecard is used as a tool to profile customers into risk bands, which enables the bank to take further action based on which band the loan applicant falls in. This might also ensure that a more cohesive result is obtained.

--

--