Introduction
of Credit Card Fraud Detection Datasets
Problem
Statement of
Credit Card Fraud Detection Datasets:
In this research paper the
problem statement is Credit Card Fraud detection which is also involves the
modeling of transaction of past credit card by knowledge of one which turned
out be fraud. The model is used identify the new transaction which may be
fraudulent or not. The purpose of research is to detect 100% fraudulent transaction
thus minimizing an incorrect classification of fraud.
Problem of
Fraud Detection of Credit Card Fraud Detection
Datasets
Fraud is considered to be same as old as
humanity and the fraud can be done in many different ways. Furthermore,
developments of all such new technologies are providing number of different
methods in which all of the criminals can do frauds like these. For example in
the E-commerce all information related to the card is enough to commit any of
the fraud. Using credit is really effective in such a modern and busy life but
the frauds related to these credit cards keep on growing with every passing
year. Financial loss because of such frauds just doesn’t affect the users or
the banks but it also affects each and every individual client as well. If any
of the banks lose money like this, customers pay as well because of the high
interest rates like high fee of membership and etc. Such frauds can also affect
all the reputation along with the merchant as well through causing a loss of
non-financial that becomes difficult to be quantified in short time period and
then it can be further seen in the long time duration ( Raj et al ,
2011).
Detection of such frauds can be done by
looking at the transactions of credit card and then identifying the process
that whether any of the new authorized transaction have been done through and
fraudulent class or whether the transaction was actual. A Fraud Detection
System (FDS) should not be only present for the detection of fraud cases in an
efficient way but they also need to be cost-effective in a sense that
investigation of the cost in the transaction screening should remain in a
certain limit and it should not exceed as compared to the loss due to number of
different frauds. Author has shown that screening of just 2% of the
transactions can further take place in the reducing fraud losses which account
for the loss of 1% in the transaction of total value. Though this review of 30%
transactions can help in the reduction of loss that is being done by a fraud in
a drastic way to the 0.06%, but this increase the cost immensely. For
minimizing this cost of detection it is very essential to use all of the expert
rules along with the statistical based models (e.g, Machine Learning) for the
making of first screen among the genuine along with potential fraud and then it
is being asked to the investigators for reviewing only those cases that are at
high risk (Juszczak et al , 2008.).
Figure: Credit Card Fraud Detection process.
Impact of
frauds of Credit Card Fraud Detection Datasets
It is pretty much exciting to note this that
fraud that is being done on the credit card effects least to the owner of
credit card due to a reason that their obligation is incomplete to the
transactions that are being made. All of the existing regulations along with
the policies of cardholding protection along with some of the insurance schemes
in many countries help in the protection of interest rate of different
cardholders. Though the one’s that get affected by this fraud are most of the
merchants, those who don’t have any of the evidence in certain situations like
Digital signature for the disputing the cardholders claim for misusing the
information of card. Merchants are the ones that have to chargeback for all the
loses that took place, shipping for the cost of goods along with the fee of
card issuer along with the charges and their own administrative cost too. These
fraud cases are becoming excessive and all of them involve the same amount of
merchant that can drive away all of the customers, causing the card issuer
banks for withdrawing the services along with the result in the loss of good
will and the reputation too (Quah, , 2008).
Banks of such card issuers need to bear the
bear all of the administrative cost of the investigations into the cases of
fraud along with the cost of infrastructure for the setting up of required
facilities affiliated with hardware and the software to combat all of the
fraud. All of them experience the indirect cost by the delay in transactions.
Different studies also show that the average time period holdup between the
fake date of transaction and notification of the chargeback can be high up to
the 72 days by giving enough time to such fraudsters to cause extreme damage.
Credit Card
Fraud Detection
Detecting the credit card completely relies
on the analyzation of all the recorded transactions. Data associated with the
Transactions mainly consists of the number of attributes (like identifier of
the credit card, date of transaction, amount used in the transaction process
along with the recipient). All of these automatic systems are very important
since it is not always easy for the human to detect such fraud patterns in the
datasets of transactions, it is at times being known by the samples of large in
number, different kind of the dimensions along with the online updations. Along
with this, cardholder is not considered to be reliable at all in reporting such
kind of thefts, loss of the card or such fraud use of their cards. Here we are
not going to discuss the benefits or drawbacks of Expert Driven and Data Driven
approaches to know about the detection of suchs frauds (Pavía et al , 2012).
One of the unique way to know about the detection is
through the means of Data Driven method like, setting of the FDS based upon the
Machine Learning even to know that is data being supervised or unsupervised
with number of different ways are known to be associated with this fraud
behavior or action. By the help of Machine Learning, all of us help in letting
the computers for the discovery of such fraud methods in the data available. This
thing also has number of different advantages as well as the drawbacks, such as
by the help of different algorithms of machine learning we can,
Know about some of the complicated fraudulent outlines by
using all of the features that are available.
·
Ingestion
of the large set of data.
·
Complex
distributions of different models.
·
Prediction
for the new kind of fraud.
·
Adapt for
the certain changing distributions to evaluate the case of fraud.
Though, this also has some of the disadvantages like,
·
They
require enough different kind of samples
·
Some of
the models are black boxed like they are not being interpreted by the investors
that easily and they also don’t provide a kind of understanding for the reason
that why an alert was being generated.
Challenges is
Fraud Detection
Design for the FDSs that employs the DDMs
completely relies upon the Machine Learning Algorithm and it is also much
challenging because of certain reasons like:
Such frauds help in the representation of a very minute
fraction for the transactions being done on all the day .This kind of Fraud
distribution gets evolved over the certain time periods due to the new attack
methods and seasonality as well. True nature of the many different transactions
is typically being called after the few days of transaction that was being done,
by then only certain transactions is being checked by the investigators on time
(Pozzolo et al , 2015)
·
The very
first enlisted challenge is also called as the unbalanced problem only since
then the distribution of the different transactions is being skewed towards the
true class. Distributions of such genuine along with all of the fraud samples
are not just unbalanced but they are also overlapping as shown in the plot over
the first two principal components in the below Figure. Many of the Machine
Learning Algorithms are not being designed for coping up with the overlapped
and the unbalanced distributions of the class (Pozzolo, 2015)
·
Variation
in the fraudulent activities along with the behavior of different customers is
known to be the main responsible of the non-stationary in the different streams
of transactions. This is the situation that is being termed towards the concept
of drift.
·
Third
challenge is being associated with the fact that in the setting of a real word,
it is completely impossible to look out for all the transactions. Cost of human
labor constrains all the number of the different alerts that is being returned
by the FDS that can further be validated through the investigators. Different
Investigators look out for the alerts of FDS through calling out the
cardholders and then prodding them with the FDS along with feedback that
indicated that whether alerts were associated to genuine or the fraud
transactions.
Literature review
of Credit Card Fraud Detection Datasets
According to the author Panigrahi et al (2009), it is conducted that as
this is the electronic society, E-commerce has now become as one of the most
useful channel for the sales in terms of whole global business. Because of the
rapid advancement of the E-commerce, using credit cards for purchasing has been
increased in a dramatic way. People love to shop through their credit cards.
Inappropriately, this fraudulent use of the credit cards has also become one of
the most attractive source or platform to revenue such criminals. Presence of
credit card fraud is increasing day after day because of the exposure from the
security weakness in the processing system of traditional credit card those
results into the loss of billions of dollars each year. These days, Fraudsters
are using some of the sophisticated methods to perpetrate the fraud of credit
card. All such fraudulent activities that are present in the whole world unique
number of different challenges to different banks along with some other
financial institutions as well those who issue the facility of giving credit
cards. In the case of bank cards like MasterCard of the Visa, there is a study
that is being done by the American Bankers Association back in 1996 and it also
reveals that estimated gross fraud of the loss was almost $790 million in the
year 1995. Major loss because of the credit card fraud has been suffered by the
USA alone.
This is not surprising at all because 71% of
all the credit cards are being issued in the USA only. Back in 2005, total
fraud loss in the USA was being reported to be $2.7 billion and this has been
gone up till $3.2 billion in the year 2007. Another survey was being done for
almost 160 different companies and it was being revealed out that online fraud
is almost 12 times higher as compared to the online fraud that is being
committed through using the stolen physical card. For addressing this problem,
number of different financial institutions employ the number of different
prevention tools for the fraud like authorization of the real time credit card,
they further devise out multiple ways for the card verification codes, rules
that are being based on the detections and much more. But fraudsters are
completely adaptive and they are being given time, they further devise the
number of different ways to avoid the mechanisms of such kind of protections.
In spite of all of the best efforts for the financial institutions, a different
law enforcement agency along with the government there has been a rise in the
credit card fraud (Panigrahi et al , 2009).
Fusion approach
using Dempster–Shafer theory
According to author Chen et al (2004), they have suggested for the use of parallel
granular network for speeding up all of the data mining process along with the
process of knowledge discovery. Author has also outlined about the automated
protection of the credit card fraud by the ANN system along with the Bayesian
belief networks (BBN). They further show out that BBN gives all of them better
outcomes that are linked to the fraud detection as well as the training period
is way faster while the actual process of detection is also faster with the ANN.
These neural network based methods in general are very fast but they are not
considered to be accurate. Re-training of such neural networks is also known to
be the major bottleneck as training time is much high. Author also proposes
about the novel method in which online questionnaire method is being used for
the collection of questionnaire-responded transaction (QRT) data of users. A
support vector machine (SVM) is being trained by all this data and the models
of QRT that are being used for the prediction of new transactions. Author
recently presented one of the personalized approaches towards the detection of
credit card fraud which uses both SVM and ANN. All this helps in the prevention
of fraud for number of different users even without doing any of the data
transaction. Though, all these systems are not being automated completely and
they also depend upon the expertise level of different users. Some of the
researchers have also applied this data mining for the detection of credit card
fraud. Author here divides the large different kind of sets into small subsets
and then they further apply the distributed data mining for the building models
of different user behaviors. Writers here have also explored the possibility
for combining all of the advanced techniques of data mining along with the
neural networks for obtaining the high fraud coverage by the help of low false
alarm level. Using this data mining is also being explained in the work through
the Author. Data mining techniques are much accurate and they are slow as well (Chen et al , 2004) .
Association
rules applied to credit card fraud detection
Association rules are known to be the best
studied models for the process of data mining. Here in this article, proposed
methodology is being used to extract the knowledge so that the normal behavior
patterns can be obtained in the unlawful transactions from transactional credit
card databases for the detection and prevention of the fraud. This proposed
methodology has also been used on the data related to the credit card fraud in
many essential retail companies. Here in this respect, all of the selected process
supports one of the widest used strategies of sustained growth along with the
differentiation in this industry to get the loyalty of different clients. No
doubt that it is true that mass issue of the credit cards through the different
departmental stores has been much successful as being the marketing project. It
is true in the same way that the increase risk of getting exposed towards the
illegal activity has been demonstrated through the growing capability towards
the fraud which is being highlighted in the publications by specialists.
While it is true that the mass issue of
credit cards by department stores has been successful as a marketing project,
it is equally true that this has increased the risk of exposure to illegal
activity, as demonstrated by the growing tendency for fraud which is
highlighted in specialist publications (e.g. the latest Cybersource report,
Sponsored by Cybersource Corporation Conducted by author variation for the
client portfolio by the mass issue of credit cards along with the aggressive
marketing plans that motivated the diverse use of doing payments by such
methods. They are also being associated by lacking of some useful techniques as
well as the intelligent systems for enabling the useful detection and
prevention of the illegal use. This is the effort that has been shown in
different articles; they offer number of different ways for detection and
prevention from this kind of illegal behavior. These entire associate with the
observations of Bhatla’s back in 2002 in which he showed the some of the
evaluated systems are the prone for the guaranteed effectiveness and no
technology is present that can eliminate this fraud completely all alone.
According to his opinion, combination of all such techniques can be helpful in
detection and prevention. Results of the Cybersource survey (Sponsored by
Cybersource Corporation Conducted by author shows manual control is the one
that is still considered to be one of the most used process for the detection
and then prevention of fraud (Sánchez,et al , 2009).
Parallel
Granular Neural Networks of Credit Card Fraud
Detection Datasets
According g to author Syeda, et al (2002) narrated that technologies that are being used in the
detection of fraud includes models of neural network, engines of intelligent
descisions , business modes, expert systems along with the Meta leming agents.
Context vector is the one that gives the mean to encode all of the textual
information to one form that can be processed easily by the help of systems,
algorithms of training assign all the context vectors to different objects in a
way that all of the vectors for the associated objects in a way that all of the
objects will be closed together.
Different kind of the text processing problems can also be solved out by
the use of this technology and after getting combined rule based systems and
neural networks it can help in the improvement for the performance of detecting
fraud. JAM is a kind of extensible agent that is being based upon the
distributed data mining system which supports the dispatching of remote.
Different expert systems can also be used for the conjunction with many of the
neural networks for the detection of fraud. Traditional statistical processes
lack the capability of a neural network to build the models that are highly
accurate. (Syeda, et al , 2002)
Methodology of Credit Card Fraud
Detection Datasets
This section of the research study
particularly explores the materials and methods that are adopted in this study.
It explains the set of tools that are explained for this study. This study is
conducted by utilizing the forecast data that is particularly utilized for the
models of the frauds which were come from the transactions of the real time.
This data is based upon the history of the database along with authorized
information. In the certain extent,
inquiry information related posting transactions as well as non-monetary
information were utilized. More than 40 fields have been attained as the
transaction database. The entire details cannot be revealed for the utilized
data set according to the agreement of the nondisclosure of the terms. The
schema of the details of database for the not contents of the data.
There are only few things that will be
explains and explores in this study as the data schema this data has been
collected from the banks. The data which has been utilized in this study it has
been already labeled by the various banks in the terms of the non-frauds and
frauds. It has been observed that there is the 0.07% transaction that is
considered as the fraud transactions. In this study both kinds of the data has
been utilized as all of the fraud data has been utilized and non-fraud data is
also used in this study. This both kind of the data has been sampled from the
non-fraud records as the training set. In this study the data has been
processed in this particular way such as the missing values were omitted. In
accordance with distribution, numerous transformations were conducted for the
particular data with the accordance of the particular original variables. The
data distribution is includes as the standardization, long transformation, a
data discretization for creating the various kinds of the derivatives
variables. In the particular required variables or the data set the extraction
and selection of the features has been conducted. Therefore, the final data set
for the modeling has been attained.
3.2 Methods of Credit Card Fraud Detection
Datasets
There
are the three particular methods that have been adopted for detecting the fraud
in any organization. These methods are; Logistic Regression, Neural Networks
and Decision Tree.
Decision Tree of Credit Card Fraud Detection
Datasets
The
decision tree method has been adopted or developed by considering the concepts
of the learning systems. This method is also known as ID3 method. It can easily
deals with continuous data. The problem has been separated by using the
decision tree along with the strategies of the revolving and separating the
data. Such complex problem can be converted in to the several simplest ones (Shen, 2007). It can easily
resolve sub problems by utilizing it repeatedly. It is also known as the
methods of the data mining in order to discover the various kinds of the
classifying knowledge by constructing the decision tree.
The
major and most important model of the decision tree is related to that how the
decision tree can be constructed along with the small scale as well as high
precision. The decision tree is considered as the table of the tree shape along
with connecting lines. In this table each node is considered as the node of the
ramifications that is followed by the more nodes. It is also known as the one
leaf node that is signed by the classification. There are several advantages of
the decision tree methods. One of the most important benefits is related to the
high flexibility which is also known as the non-parameter method by considering
any assumptions for the distributions of the data. The second thing is related
to the good haleness. It can also explain easily that is considered as the
reason of the broader utilization.
Neural Networks of Credit Card Fraud Detection
Datasets
An architectures and
neural networks topologies are formed by the organizing nodes into various
layers that are linked with these neurons of the layers along with
interconnections of modifiable weighted. As of late, neural system researchers
have fused strategies from measurements and numerical investigation into their
systems.
Being
a nonlinear mapping connection from the information space to yield space,
neural systems can gain from the given cases and outline the inward standards
of information even without knowing the potential information standards ahead.
What's more, it can adjust its own conduct to the new condition with the
aftereffects of arrangement of general ability of advancement from current
circumstance to the new condition. From the part of the unadulterated
hypothesis, the nonlinear neural systems strategy is better than the factual
strategies in the application for charge card extortion recognition.
It
is at some point irregular in the training research despite the fact that the
regular favorable circumstances of the neural systems as a potential
aftereffect of use of ill-advised system structure and picking up processing
strategy. Then again there are as yet numerous detriments for the neural
systems, for example, the trouble to affirm the structure, the proficiency of
preparing, over the top preparing, etc.
Logistic Regression of Credit Card Fraud
Detection Datasets
At
the data mining task more and more models were applied. The task of the data
mining includes multiple discriminant analysis, regression analysis, probit
methods and logistic regressions etc. The logistic regression is particularly
utilized for the situation that is required for predicting the absence or
presence of the characteristics those outcomes based upon the set of the values
for the predictors variables.
It
is seemed like the models of the linear regression that is suited for those
models at where it needs dichotomous and dependent variables. In order to
estimates odds ratios the coefficients of the logistics regression can be
utilized for each of the independent variables of the models. It can applicable
for the broader range of the research situations than discriminant analysis. In
this study two of the most important model has been introduced such as;
multivariate conditional probability models as well as linear probability model
for the predictions of the business failure literature. It also includes the
contribution of these methods which was required for estimating the odds of a
firm’s failure with probability.
Discussion
of Credit Card Fraud Detection Datasets
Dataset
analysis of Credit Card Fraud Detection
Datasets
The problem is taking from the Kaggle.
For the credit card companies the fraud is
the significant problem due to the large column of transactions which are
completed in every day as well as due to various fraudulent transactions look a
lot like the normal transactions (Brownlee, 2020).
Observations
of Credit Card Fraud Detection Datasets
·
The
dataset is much skewed which is also consisting of 492 frauds by the total of
284,807 observations. The conclusion is around about 0.172% fraud cases. Then
this type skewed dataset is also justified through a low number of the
fraudulent transactions.
·
The
numerical values also consist on the dataset from the 28 PCA (principle component
analysis which is also transformed as V1 to V28. Therefore in this have no
metadata regarding to the original features which is required so that is why
the pre-analysis and the feature study could not done.
·
The “amount”
as well as “Time” features do not transformed the data.
·
In the
dataset there are no missing values (Frei, 2019).
For the credit card detection uses the data science and machine learning
algorithm.
There are following current approaches which are used as the algorithm (Maniraj,
2019);
·
Artificial Neural Network
·
Fuzzy Logic
·
Genetic Algorithm
·
Logistic Regression
·
Decision tree
·
Support Vector Machines
·
Bayesian Networks
·
Hidden Markov Model
·
K-Nearest Neighbours
Conclusion
of Credit Card Fraud Detection Datasets
Fraud detection is one of the most complex problems which require a
huge amount of planning before throwing of the machine learning algorithm at
it. But it is also an application of the data science along with the machine
learning for the good, that makes this sure that all the money of different
customers are safe.
By summing
up entire discussion, it has been concluded that due to the rapid growth of
technology the ratio of the crimes and frauds is also increasing. It has been
observed that Credit Card
Fraud ratio is increasing as well. It has become necessary to overcome such
kinds of the problems to protect the confidentiality and confidential data of
the customers and respondents. The said study is conducted to introduce and
explores the data set for Detection of credit card frauds. In this study the
dataset has been explores for detecting the frauds of the credit cards. This
data set can be used essentially and it can explain and explores the
information in effective manners.
This study also explains the three
classification methods that have been used for the deeper analysis of the
credit card history business information. It also have builted the various
models for the detection if the fraud. This study has also demonstrates the
techniques of the data mining along with its advantages. These methods are
includes as the decision tree, neural networks and logistics regression for
detecting the frauds of the credit card. This study also offers the various
ways to protecting from the various kinds of the bank’s risk. The results show that the proposed
classifier of neural networks and logistic regression approaches outperform
decision tree in solving the problem under investigation.
Future work of Credit Card Fraud Detection
Datasets
Future work will have the comprehensive tuning related to the Random
Forest algorithm which I was talking before. To have the data with the
non-anonymized features would for sure make all this interesting as an output
for the featured importance that would make one able to see that what are the
specific factors which are important for the detection of transactions that are
being done in a fraud way.
References of Credit Card Fraud
Detection Datasets
Raj et al , B. E. (2011). Benson Edwin Raj, S., &
Annie Portia, A. (2011). Analysis on credit card fraud detection methods. International
Conference on Computer, Communication and Electrical Technology (ICCCET).
Brownlee, J. (2020, March 11). Imbalanced Classification
with the Fraudulent Credit Card Transactions Dataset. Retrieved from
https://machinelearningmastery.com/imbalanced-classification-with-the-fraudulent-credit-card-transactions-dataset/
Chen et al , R. (2004). Detecting credit card fraud by
using questionnaire-responded transaction model based on support vector
machines, in:. Proceedings of the Fifth International Conference on
Intelligent Data Engineering and Automat, 800– 806.
Frei, L. (2019, janurary 16). Detecting Credit Card
Fraud Using Machine Learning. Retrieved from
https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8
Juszczak et al , P. (2008.). Piotr Juszczak, Niall M
Adams, David J Hand, Christopher Whitrow, and David J Weston. Off-the-peg and
bespoke classifiers for fraud detection. Computational Statistics & Data
Analysis, 52(9), 4521–4532.
Maniraj, S. (2019). Credit Card Fraud Detection using
Machine Learning and Data Science. International Journal of Engineering
Research & Technology.
Panigrahi et al , S. (2009). Credit card fraud
detection: A fusion approach using Dempster–Shafer theory and Bayesian
learning. Information Fusion, 354–363.
Pavía et al , J. (2012). Credit card incidents and
control systems. International Journal of Information Management, 32(6),
501–503.
Pozzolo et al , A. D. (2015). Credit card fraud
detection and concept-drift adaptation with delayed supervised information. In
Neural Networks (IJCNN) International Joint Conference .
Pozzolo, A. D. (2015). Adaptive Machine Learning
for Credit Card Fraud Detection. Université Libre de Bruxelles.
Quah, . (2008). Real-time credit card fraud detection
using computational intelligence. Expert Systems with Applications,
1721–1732.
Sánchez,et al , D. (2009). Association rules applied
to credit card fraud detection. Expert Systems with Applications,
3630–3640.
Syeda, et al , M. (2002). Parallel granular neural
networks for fast credit card fraud detection. IEEE International Conference
on Fuzzy Systems.