- Published on
Battle of the Databases: Conquering Customer Data Duplicates
- Authors
- Name
- Hachiro
Introduction
For my Final Year Project in Data Analytics at Universiti Malaysia Pahang Al-Sultan Abdullah (UMPSA), I tackled a common yet challenging problem in data management: duplicate detection in customer databases. Have you ever dealt with messy customer data where the same person appears multiple times with slight variations—maybe a missing middle name, a typo in the address, or different phone numbers? This was the puzzle I chose to solve.
Under the supervision of Dr. Noryanti Muhammad and industrial guidance from Mohd Izhar Firdaus Ismail (Director of Data & AI, ABYRES HOLDINGS SDN BHD), I explored how different database systems handle this real-world challenge. The project was inspired by my internship experience at Abyres Enterprise Technologies, where I witnessed firsthand how duplicate records affected business operations, from customer management to fraud detection.
Traditional methods often rely on comparing fields like names, emails, or phone numbers—but what happens when the data is incomplete, inconsistent, or just plain wrong? That's when I wondered:
💡 Would a graph database perform better than a relational one in detecting duplicates?
Research Methodology
For my FYP, I developed a comprehensive testing framework using Python libraries like Faker and Mockaroo to generate synthetic datasets that mimic real-world scenarios. The data was structured into three main components:
- Customers (names, emails, phone numbers)
- Transactions (purchases, payment methods)
- Products (items bought, prices)
I wanted to evaluate how different database systems handle both exact and approximate duplicate detection. The systems I tested were:
🔹 PostgreSQL – The powerhouse for structured data
🔹 MySQL – A widely used relational database
🔹 Neo4J – A graph database designed for complex relationships
Technical Implementation
The project involved implementing several sophisticated algorithms:
📌 Label Propagation & Louvain
- Used in Neo4J for community detection
- Helped identify clusters of potentially related customer records
📌 Jaro-Winkler & Jaccard Similarity
- Implemented in PostgreSQL and MySQL
- Measured string similarity for fuzzy matching of names and addresses
📌 Cosine Similarity
- Applied to compare vectors of customer attributes
- Particularly effective for multi-field comparison
Research Findings
After months of testing and analysis, here are the key discoveries:
PostgreSQL's Strengths
✔ Superior Scalability: Successfully processed our test dataset of 1 million records in under 2 minutes
✔ Structured Query Performance: 95% accuracy in identifying exact duplicates
✔ Built-in Functions: pg_trgm extension provided efficient fuzzy matching capabilities
Neo4J's Advantages
✔ Relationship Analysis: Discovered 15% more complex duplicate patterns than relational databases
✔ Pattern Recognition: Identified customer clusters with 88% accuracy
✔ Traversal Speed: 3x faster in exploring interconnected customer relationships
MySQL's Performance
✔ Simple Duplicate Detection: Perfect for datasets under 100,000 records
✔ Resource Efficiency: Used 40% less memory compared to PostgreSQL
✔ Ease of Use: Simpler implementation for basic deduplication needs
Academic Impact & Industry Applications
This research contributed to both academic knowledge and practical applications:
Academic Contribution
- Published in UMPSA's research repository
- Presented at the Faculty of Computing's Final Year Project Exhibition
- Presented at SEMINAR KEBANGSAAN ISM-XVII DAN MENSYUARAT AGUNG 2024 at Universiti Putra Malaysia (7 March 2024)
- Received recognition for innovative approach to data quality management
Industry Relevance
- Developed under the guidance of Abyres Enterprise Technologies' Data Engineering team
- Findings shared with enterprise clients for potential implementation
- Created a framework for database selection based on use case requirements
- Methodology applicable to enterprise data warehouse solutions
Lessons Learned & Recommendations
Through this FYP, I discovered that there's no one-size-fits-all solution in database management. Here are my evidence-based recommendations:
Choose PostgreSQL when you need:
- Fast processing of large datasets (>500,000 records)
- Advanced string matching capabilities
- Enterprise-grade reliability
Go with Neo4J when you have:
- Complex customer relationships
- Need for pattern discovery
- Graph-based analysis requirements
Consider MySQL for:
- Small to medium-sized datasets
- Basic deduplication needs
- Resource-constrained environments
Resources & Further Research
For those interested in building upon this research:
Acknowledgments
Special thanks to:
- Dr. Noryanti Muhammad (Project Supervisor)
- Mohd Izhar Firdaus Ismail (Industrial Coach, Director of Data & AI at ABYRES HOLDINGS SDN BHD)
- Center of Mathematics and Sciences, UMPSA
- Abyres Enterprise Technologies for industry insights
- My fellow final year students for their feedback and support