##plugins.themes.bootstrap3.article.main##

Arif Ahmad Shehloo

Ganesh Gopal Varshney

Abstract

The exponential growth of diverse digital data continues to present significant challenges in efficient storage and meaningful analysis. Apache Spark, with its in-memory cluster computing capabilities, has evolved into a cornerstone solution for effective big data analytics. This study evaluates the analytical performance of Spark's machine learning library (MLlib) using classification algorithms on a real-world banking dataset, while also exploring recent advancements in big data processing and machine learning. Three models - Logistic Regression, Decision Tree, and Random Forest - were trained on the dataset to predict loan approval outcomes, showcasing MLlib's scalability and processing speed. The study demonstrates MLlib's efficiency in parallelizing computation and model training across distributed datasets, making it well-suited for large-scale data processing. Recent developments, including improved integration with deep learning frameworks, enhanced AutoML capabilities, and advancements in real-time processing, are examined. Performance benchmarks are updated to reflect the latest versions of Spark and MLlib, providing current insights into their capabilities. The study's findings align with industry trends, indicating the increasing adoption of Apache Spark and MLlib by enterprises aiming to harness the full potential of big data, particularly in the banking and fintech sectors. By exploring these recent developments and their implications, this research underscores the ongoing significance of Apache Spark MLlib in real-world applications, especially in domains requiring accurate predictive analytics like banking.

##plugins.themes.bootstrap3.article.details##