A credit card transaction dataset is used to perform exploratory data analysis and apply machine learning algorithm to predict fraud transaction. The data set consists of the details related to transaction, card use and details of owner. The dataset was in json format saved as text file. The dataset consists of 29 columns and 786363 rows. The goal is to derive insights like multiple card swipe, amount reversals, find insights with respect to transaction amount and build a machine learning model to predict fraud.
Data stored in HDFS and then imported using PySpark in python. The imported data was in PySpark data frame which further converted to pandas data frame for further modifications and analysis
Model Used:
Logistic Regression
Gradient Boosting
Random Forest Classifier
IDE: Jupyter
Packages:
Pandas, hdfs, pyspark, matplotlib, seaborn, simplejson, numpy, sklearn