← Back to portfolio

Financial Data Cross-Referencing & Big Data Processing
with Python / PySpark

ℹ️ Important Information

Due to strict regulations and data sensitivity in the financial sector, all data used and displayed in this project is entirely fictitious. Column and file names are invented and do not refer to any real individuals or organizations. The purpose of this project is to demonstrate technical expertise and the kind of results achievable in a real-world financial data context.

🎯 Project Goals

The objective is to design a tool for cross-referencing user financial information (debts and assets) from various data sources: internal debt files and credit files. Initially limited to a pilot geographic area, the project now targets nationwide deployment.

🗂️ Functionality & Data Processed

🛠️ Technical Objectives

🔗 Resources & Tools
Tools used: Python, PySpark, Excel

📋 Solution Workflow

  1. Data Collection and Preparation
  2. Data Merging, Deduplication, and Cross-checking
  3. Enrichment, Cleaning, and Export for Dashboarding
  4. Processing High-Volume Data using PySpark

1️⃣ Part 1: Data Collection and Preparation

📝 Description

The aim is to gather and prepare all financial data sources (debts and credits) from different internal applications for further cross-referencing.

Part1 screenshot

2️⃣ Part 2: Data Merging, Deduplication, and Cross-checking

📝 Description

Cross-reference user information across all sources, identify users present in multiple files, and aggregate key financial data.

Part2 screenshot

3️⃣ Part 3: Enrichment, Cleaning, and Export for Dashboarding

📝 Description

Enrich the dataset with administrative reference data, clean and harmonize fields (types, translations), and prepare the file for use in BI tools or national dashboards.

Part3 screenshot

4️⃣ Part 4: Processing High-Volume Data Using PySpark

📝 Why PySpark?

The Transverse-statistical_file.csv is too large for classic Python/pandas processing on standard machines (RAM limits). This step uses PySpark—a distributed framework designed for big data.

🎯 Goals
Part4_1 screenshot
Part4_2 screenshot
🧑‍💻 Main Processing Steps

✅ Summary & Key Learnings