← Index · ← File index · ← Prev · Next → · 📄 source

#836: `DoubletsResultFilter.kt`

projectforge-business/src/main/kotlin/org/projectforge/business/address/DoubletsResultFilter.kt Kotlin result filter — address deduplication post-filter. Implements CustomResultFilter<AddressDO> to detect and extract duplicate address entries (doublets) based on normalized fullname comparison. Location: projectforge-business/src/main/kotlin/org/projectforge/business/address/DoubletsResultFilter.kt 57 lines · 30 code · 23 comments · 4 blank

Purpose: When an address book grows organically (imports, manual entries, data migration), duplicate contacts accumulate — often with minor variations in formatting but identical normalized names. This filter enables the address list UI to show a "Show doublets only" view, helping administrators find and merge duplicates.

Architecture

Implements CustomResultFilter<AddressDO> — a ProjectForge framework interface for post-query result filtering. Unlike database-side filtering (WHERE clauses), result filters run in-memory on the result set returned by the DAO. This is necessary for deduplication because:

• Duplicates may span different database rows with different IDs
• Normalization logic (getNormalizedFullname) involves string manipulation that can't be expressed in SQL/HQL easily
• The filter needs to accumulate state across multiple elements — the match() method is called for each element in order but mutates internal sets

State Machine

The filter operates as a two-set, two-pass state machine:

Pass 1 (accumulation): As each address element is fed to match():
• If the normalized fullname is already in fullnames, it's a doublet — added to doubletFullnames
• If not in fullnames, it's a first occurrence — added to all and fullnames

Pass 2 (flush): When a doublet is detected, all previously-seen addresses whose names match any doublet name are added to the result list. The addedDoublets set prevents adding the same address twice.

Data Structure Choices

Field	Type	Purpose
`fullnames`	`MutableSet<String>` (HashSet)	Tracks all unique normalized names seen so far — O(1) lookup
`doubletFullnames`	`MutableSet<String>` (HashSet)	Tracks names confirmed to have ≥2 occurrences — used for retroactive flushing
`all`	`MutableList<AddressDO>` (ArrayList)	Ordered list of all processed addresses — needed because earlier entries weren't returned as matches
`addedDoublets`	`MutableSet<Long?>` (HashSet)	Deduplication guard — prevents adding the same address ID twice to the result

Algorithm Walkthrough

For each address in the query result (sorted by name):

1. Skip deleted addresses → return false
  
2. Compute normalizedFullname = AddressDao.getNormalizedFullname(element)
   (strip whitespace, lowercase, etc.)

3. Check fullnames.contains(normalizedFullname):
   
   NO (first occurrence):
     all.add(element)
     fullnames.add(normalizedFullname)
     return false  (don't include in results yet)
   
   YES (duplicate detected):
     doubletFullnames.add(normalizedFullname)
     
     // Flush: add all previously-seen addresses sharing this name
     for (adr in all):
       if (addedDoublets.contains(adr.id)) continue  // skip already-added
       if (doubletFullnames.contains(getNormalizedFullname(adr))):
         list.add(adr)                                // inject into result
         addedDoublets.add(adr.id)                    // mark as injected
     
     // The current element (the duplicate) is also returned
     return true

Git History

868d6abb7 2025 → 2026 (copyright year update)
63081666f Source file headers: 2024→2025
4c04cfd65 MAJOR: Migration of integer id's to Long id's (FK cascades)
067a4cbb1 Migration stuff in progress...
0d183e5df Migration stuff in progress...
b7b459e73 Migration stuff in progress...

The file emerged during the int→Long migration period, when result filters were restructured to use Long IDs.

Design Rationale

Why a post-filter rather than a database GROUP BY + HAVING? A SQL GROUP BY normalized_name HAVING COUNT(*) > 1 would require the normalization function to be expressible in SQL (possible but fragile with SQL dialect differences) and wouldn't provide the exact entity objects needed for the UI's merge dialog. The post-filter approach keeps normalization logic in Java/Kotlin where it's typed, testable, and consistent with the rest of the application.

Why the retroactive flush pattern? The match() interface returns true/false per element (include/exclude from results). Elements processed before a doublet is detected were already excluded (returned false). The retroactive flush injects those earlier instances by mutating the result list directly. This is an intentional side-effect — the filter has write access to the list and uses it to compensate for the streaming nature of the match() API.