Battle of the Databases: Conquering Customer Data Duplicates

Introduction

For my Final Year Project in Data Analytics at Universiti Malaysia Pahang Al-Sultan Abdullah (UMPSA), I tackled a common yet challenging problem in data management: duplicate detection in customer databases. Have you ever dealt with messy customer data where the same person appears multiple times with slight variations—maybe a missing middle name, a typo in the address, or different phone numbers? This was the puzzle I chose to solve.

Under the supervision of Dr. Noryanti Muhammad and industrial guidance from Mohd Izhar Firdaus Ismail (Director of Data & AI, ABYRES HOLDINGS SDN BHD), I explored how different database systems handle this real-world challenge. The project was inspired by my internship experience at Abyres Enterprise Technologies, where I witnessed firsthand how duplicate records affected business operations, from customer management to fraud detection.

Traditional methods often rely on comparing fields like names, emails, or phone numbers—but what happens when the data is incomplete, inconsistent, or just plain wrong? That's when I wondered:

💡 Would a graph database perform better than a relational one in detecting duplicates?

Research Methodology

For my FYP, I developed a comprehensive testing framework using Python libraries like Faker and Mockaroo to generate synthetic datasets that mimic real-world scenarios. The data was structured into three main components:

Customers (names, emails, phone numbers)
Transactions (purchases, payment methods)
Products (items bought, prices)

I wanted to evaluate how different database systems handle both exact and approximate duplicate detection. The systems I tested were:

🔹 PostgreSQL – The powerhouse for structured data
🔹 MySQL – A widely used relational database
🔹 Neo4J – A graph database designed for complex relationships

Technical Implementation

The project involved implementing several sophisticated algorithms:

📌 Label Propagation & Louvain

Used in Neo4J for community detection
Helped identify clusters of potentially related customer records

📌 Jaro-Winkler & Jaccard Similarity

Implemented in PostgreSQL and MySQL
Measured string similarity for fuzzy matching of names and addresses

📌 Cosine Similarity

Applied to compare vectors of customer attributes
Particularly effective for multi-field comparison

Research Findings

After months of testing and analysis, here are the key discoveries:

PostgreSQL's Strengths

✔ Superior Scalability: Successfully processed our test dataset of 1 million records in under 2 minutes
✔ Structured Query Performance: 95% accuracy in identifying exact duplicates
✔ Built-in Functions: pg_trgm extension provided efficient fuzzy matching capabilities

Neo4J's Advantages

✔ Relationship Analysis: Discovered 15% more complex duplicate patterns than relational databases
✔ Pattern Recognition: Identified customer clusters with 88% accuracy
✔ Traversal Speed: 3x faster in exploring interconnected customer relationships

MySQL's Performance

✔ Simple Duplicate Detection: Perfect for datasets under 100,000 records
✔ Resource Efficiency: Used 40% less memory compared to PostgreSQL
✔ Ease of Use: Simpler implementation for basic deduplication needs

Academic Impact & Industry Applications

This research contributed to both academic knowledge and practical applications:

Academic Contribution
- Published in UMPSA's research repository
- Presented at the Faculty of Computing's Final Year Project Exhibition
- Presented at SEMINAR KEBANGSAAN ISM-XVII DAN MENSYUARAT AGUNG 2024 at Universiti Putra Malaysia (7 March 2024)
- Received recognition for innovative approach to data quality management
Industry Relevance
- Developed under the guidance of Abyres Enterprise Technologies' Data Engineering team
- Findings shared with enterprise clients for potential implementation
- Created a framework for database selection based on use case requirements
- Methodology applicable to enterprise data warehouse solutions

Lessons Learned & Recommendations

Through this FYP, I discovered that there's no one-size-fits-all solution in database management. Here are my evidence-based recommendations:

Choose PostgreSQL when you need:
- Fast processing of large datasets (>500,000 records)
- Advanced string matching capabilities
- Enterprise-grade reliability
Go with Neo4J when you have:
- Complex customer relationships
- Need for pattern discovery
- Graph-based analysis requirements
Consider MySQL for:
- Small to medium-sized datasets
- Basic deduplication needs
- Resource-constrained environments

Resources & Further Research

For those interested in building upon this research:

Acknowledgments

Special thanks to:

Dr. Noryanti Muhammad (Project Supervisor)
Mohd Izhar Firdaus Ismail (Industrial Coach, Director of Data & AI at ABYRES HOLDINGS SDN BHD)
Center of Mathematics and Sciences, UMPSA
Abyres Enterprise Technologies for industry insights
My fellow final year students for their feedback and support