Duplicate Contacts Manager 2.1.1 Requires Restart
by David von Oheimb
This Thunderbird add-on searches address book(s) for pairs of matching contact entries.
It can automatically delete entries that have equivalent or less information than the other one.
Any remaining matches are presented for manual treatment.
About this Add-on
Please report any issues on https://github.com/DDvO/Duplicate-Contacts-Manager/issues/
After installation, this add-on can be invoked via the '
Tools->Duplicate Contacts Manager...
' menu entry. One can also customize the 'Toolbar
' of the 'Address Book
' window with a 'Find Duplicates
' button.The Duplicate Contacts Manager searches address books for pairs of matching contact entries, also known as cards.
It can automatically delete all cards that have equivalent or less information than some matching one.
Any remaining pairs of matching cards are presented as candidate duplicates for manual treatment.
Each two cards are shown side-by-side with a comparison of all fields containing data, including any photo.
Some important fields are always shown such that they can be filled in when they have been empty so far.
When pairs of candidate duplicates are presented, various comparison information is given in the column between them.
- The '≡' symbol is shown between non-empty fields with identical values, while non-identical values are highlighted by color.
All other relations are determined after abstraction of values (see the definitions below). - The '≃' symbol indicates matching names, email addresses, or phone numbers.
- The '≅' symbol indicates equivalent cards, equivalent fields, or equal sets (after abstraction).
- The '⊆' and '⊇' symbols indicate the subset/superset relation on mailing list membership, email addresses, and phone numbers.
- The '⋦' and '⋧' symbols indicate that a field or a whole card contains less/more information than the other.
- The '<' and '>' symbols indicate comparison on numerical values or the substring/superstring relation on names and other texts.
In order to exclude pairs of similar cards from being repeatedly presented for manual treatment
they may be given different
AIMScreenName
s, such that they are filtered out from the search results.There are two search modes for finding matching cards:
- within a single address book with n cards, comparing each card with all other cards, resulting in n*(n-1)/2 pairs of cards to compare.
- with two different address books with n and m cards, comparing each card in the first one with each card of the second one, resulting in n*m pairs to compare.
The matching relation is designed to be rather weak, such that it tends to yield all pairs of potential duplicates.
Two cards are considered matching if any of the following conditions hold, where the details are explained below.
- The cards contain matching names, or
- they contain matching email addresses, or
- they contain matching phone numbers, or
- both cards do not contain any name, email address, or phone number that might match.
AIMScreenName
are never considered matching.Matching of names, email addresses, and phone numbers is based upon equivalence and sub-equivalence of fields modulo abstraction, described below. As a result, for example, names differing only in letter case are considered to match.
For the matching process, names are completed and their order is normalized — for example, if two name parts are detected in the
DisplayName
(e.g., "John Doe") or in an email address (e.g., "[email protected]"), they are taken as first and last name.Both multiple email addresses within a card and multiple phone numbers within a card are treated as sets, i.e., their order is ignored as well as their types.
- Two cards are considered to have matching names if
- their
DisplayName
s consist both of one word or both of more than one word and are sub-equivalent, or
- both their
FirstName
and theirLastName
are not empty and are pairwise sub-equivalent, or
- their
DisplayName
s are empty but theirFirstName
orLastName
are not empty and are pairwise sub-equivalent, or
- in one card the
DisplayName
is empty and either theFirstName
orLastName
is not empty and is sub-equivalent to theDisplayName
of the other card, or
- their
AIMScreenName
s are not empty and sub-equivalent.
- their
- Two cards are considered to contain matching email address if any of their
PrimaryEmail
orSecondEmail
are equivalent.
- Two cards are considered to contain matching phone numbers if any of their
CellularNumber
,WorkPhone
, orPagerNumber
are equivalent. TheHomePhone
andFaxNumber
fields are not considered for matching because such numbers are often shared.
Before card fields are compared their values are abstracted using the following steps.
- Pruning, which removes stray contents irrelevant for comparison:
- ignore values of certain field types — the set of ignored fields is configurable with the default being
UID, UUID, CardUID, groupDavKey, groupDavVersion, groupDavVersionPrev, RecordKey, DbRowID, PhotoType, PhotoName, LowercasePrimaryEmail, LowercaseSecondEmail, unprocessed:rev, unprocessed:x-ablabel
,
- remove leading/trailing/multiple whitespace and strip non-digit characters from phone numbers,
- strip any stray email address duplicates from names, which get inserted by some email clients as default names, and
- replace '
@googlemail.com
' by '@gmail.com
' in email addresses.
- ignore values of certain field types — the set of ignored fields is configurable with the default being
- Transformation, which re-arranges information for better comparison:
- correct the order of first and last name (for instance, re-order "Doe, John"),
- move middle initials such as "M" from last name to first name, and
- move last name prefixes such as "von" from first name to last name.
- correct the order of first and last name (for instance, re-order "Doe, John"),
- Normalization, which equalizes representation variants:
- convert to lowercase (except for name part of AOL email addresses),
- convert texts by transcribing umlauts and ligatures, and
- if configured, replace in phone numbers the international call (IDD) prefix (such as '00') by '+'
and the national trunk prefix (such as '0')
by the default country calling code (such as '+49').
- convert to lowercase (except for name part of AOL email addresses),
- Simplification, which strips less relevant information from texts by removing accents and punctuation.
Parts of names are considered sub-equivalent if their abstracted values are equal or the abstracted value of one of them is a non-empty whole-word substring of the abstracted value of the other.
Note that the value adaptations mentioned above are computed only for the comparison, i.e., they do not change the actual card fields.
If automatic removal is chosen, only cards are removed that match some other card and have equivalent or less information than the other card and are preferred for deletion; for details see below.
When a pair of matching cards is presented for manual inspection, the card flagged by default with red color for removal is the one preferred for deletion.
A card is considered to have equivalent or less information than another card if for each field:
- the field is configured to be ignored or one of
PopularityIndex
,LastModifiedDate
,RecordKey
, andDbRowID
(which are always ignored here), or else
- the field is equivalent to the corresponding field of the other card, or
- it is a text (e.g., some name, address component, or
Notes
) and its abstracted value is a substring of the corresponding field value of the other card, or else
- it is treated as a set and the set of abstracted values is a subset of the corresponding set of the other card, or else
- after abstraction it has the default value, i.e., it is empty for text fields or its value is
0
for numerical fields orfalse
for Boolean fields.
Of two matching cards one is preferred for deletion such that
- it has fewer non-empty fields, or else the number of non-empty fields is equal and
- the character weight of the card is smaller, i.e.,
its pruned and transformed (non-ignored) textual field and phone number field values have an equal or smaller total number of uppercase letters and special characters than the other card, or else the character weight is equal and
- it is used less frequently (i.e, its
PopularityIndex
is smaller), or else it has the same usage frequency and
- it is older (i.e., its
LastModifiedDate
is smaller), or else it has the same age and
- it is found in the second address book if the address books searched are different, or else
- it is found later in the same address book.
Here is an example.
The card on the right will be preferred for deletion because it contains less information.
-
NickName
: ........... "Péte" .............................. " pete ! " .................... accent, punctuation, letter case, and whitespace ignored -
FirstName
: .......... "Peter" ............................. "Peter Y van" ............ name prefix "van" moved to last name -
LastName
: ........... "Y van Müller" .............. "Mueller" .................... middle initial "Y" moved to first name, umlauts transcribed -
DisplayName
: .. "Hans Peter van Müller" .. "van Müller, Peter" .. first name moved to the front, name is substring -
PreferDisplayName
: .. 'yes' ........................... 'yes' ............................ same value -
AimScreenName
: ...... "" ................................. "" .................................. same AIM name -
PreferMailFormat
: ... 'HTML' ......................... 'unknown' .............. default ('unknown') considered less information -
PrimaryEmail
: .. "[email protected]" .. "[email protected]" .. emails treated as sets, letter case ignored -
SecondaryEmail
: .... "[email protected]" .. "" ......................... emails treated as sets, letter case ignored -
WorkPhone
: ............. "089/1234-5678" ........ "+49 89 12345678" ... trunk prefix and international call prefix normalized and non-digits ignored -
PopularityIndex
: .... 5 ........................................ 3 ................................... field ignored for information comparison -
LastModifiedDate
: .. 2018-02-25 07:51:28 .. 2018-02-25 08:30:37 .. field ignored for information comparison -
UUID
:......................... "" ....................... "903a61be-64d5-4844-802a" ... field ignored
Technical information: The options/configuration/preferences used by this Thunderbird extension are saved in configuration keys starting with '
extensions.DuplicateContactsManager
' - for instance, the list of ignored fields is stored in the variable 'ignoreFields
'.