Match Configuration XML Definition
For most users the Matching Configurationuser interface should be sufficient for defining matching rules. There are, however, advanced use cases, where the use of the XML configuration is desired.
Before editing the matching configuration, administrators should review the Matching Enginearchitecture documentation.
Match Configuration
Matches are configured using match configuration XML, the schema for this XML can be located in %INSTALL_DIR%\Schemas and referenced in an XML editor of your choice to obtain auto-complete data.
Target and Classification Thresholds
All match configurations begin with the <MatchConfiguration> element. This element defines the overall matching process you'd like to apply.
The attributes for this element are:
Attribute | Type | Description |
id | xs:string | A unique identifier for this match configuration, used when configuring any triggers which use matching. |
nonmatchThreshold | xs:decimal | The threshold at which a candidate is excluded from consideration as being a match. |
matchThreshold | xs:decimal | The threshold at which a candidate is considered a definite match. |
The next element to be configured is the target resource and triggers. Note that some SanteDB Components don't adhere to these settings, they are for suggestion only.
Here the configuration is indicating that the matching algorithm should be applied to resources of type Patient. The match engine makes no assumption about the target resource, match targets can be resources of type Patient, Person, Organization, Place, UserEntity, DeviceEntity, etc.
Blocking Stage Configuration
The blocking stage is configured using one or more <blocking> elements, blocking elements are structured HDSI format queries which are executed against the configured persistence layer. Blocking can be configured as illustrated:
Here the blocking filter is stating it wants to consider any record in the database with a matching identifier from domain MRN (identifier[MRN].value) matching the input value ($input) and it only wants to consider records which have no detectedIssues (i.e. have not been flagged by the data quality engine as having issues).
The blocking filter will initially request the data layer to load 100 results (after which roundtrips to the database will be required).
Additional Blocking Filters
You can specify an infinite number of blocking filters to be run. Each resultset is either UNION (OR) or INTERSECT (and) with the previous. For example, the following configuration:
Will perform three queries against the database for candidate records:
All patients where identifier MRN matches the $input
UNION all patients where identifier HIN matches the $input value
UNION all patients which were:
Born within 2 years of the input record
Have a given name no more than 2 characters different than the candidate
Live in the same state
Don't have any data quality issues
Scoring Stage Configuration
Scoring is configured in the <scoring> section which applies to one or more attributes from the records loaded during the blocking phase. A simple example provided below instructs the patient's gender to be scored.
Attribute | Type | Description |
property | xs:string | The property path on both objects to consider |
u | xs:decimal | The u probability (the probability the property will agree by pure coincidence) |
m | xs:decimal | The m probability (the probability a matching value indicates a match) |
whenNull | xs:enum | The behavior the engine should take when the value is null in either record A or record B |
required | xs:boolean | Whether the attribute assertions need to evaluate (or be handled by whenNull) for matching to continue. |
Here the configuration is instructing the matching engine to consider genderConcept of record A (new object) and record B (the existing object), and the values must exactly match.
Handling Missing Data
There can occur instances of either the inbound record or the source records from the database which are missing the specified attribute. When this is the case the attribute's whenNull
attribute controls the behavior of the evaluation of that attribute. The behaviors are:
@whenNull | Description |
match | When the value is null in A or B, treat the attribute as a "match" (i.e. assume the missing data matches) |
nonmatch | When the value is null in either A or B, treat the attribute as a non-match (i.e. assume the missing data would not match) |
ignore | When the value is null in A or B, don't evaluate the attribute. (i.e. neither the non-match or match scores should be considered) |
disqualify | When the value is null in A or B disqualify the entire record from consideration (i.e. it doesn't matter what the other attribute scores are, the record is considered not a match) |
Transforming Data
You can instruct the matching classification stage to transform the data on record A and record B prior to evaluating an assertion. This is done with a transform, for example, if we wanted to only compare the week of birth we could use this configuration:
The date_extract is applied to both records and then the assertion of "eq" is applied. The following data transforms are available in SanteDB.
Transform | Applies | Action |
addresspart_extract | Entity Address | Extracts a portion of an address for consideration |
date_difference | Date | Extracts the difference (in weeks, months, years, etc.) between the A record and B record. |
date_extract | Date | Extracts a portion of the date from both records. |
name_alias | Entity Name | Considers any of the configured aliases for the name meeting a particular threshold of relevance (i.e. Will = Bill is stronger than Tess = Theresa) |
abs | Number | Returns the absolute value of a number |
dmetaphone | Text | Returns the double metaphone code (with configured code length) from the input string |
length | Text | Returns the length of text string |
levenshtein | Text | Returns the levenshtein string difference between the A and B values. |
sorensen_dice | Text | Returns the Sorensen Dice coefficient of text content A and B values. |
jaro_winkler | Text | Returns the Jaro-Winkler (with default threshold of 0.7) between A and B values. |
metaphone | Text | Returns the metaphone phonetic code for the attribute |
similarity | Text | Returns a %'age (based on levenshtein difference) of string A to string B (i.e. 80% similar or 90% similar) |
soundex | Text | Extracts the soundex codification of the input values. |
substr | Text | Extracts a portion of the input values |
tokenize | Text | Tokenizes the input string based on splitting characters, the tokenized values can then be transformed independently using any of the transforms listed in this table. |
timespan_extract | TimeSpan | Extracts a portion of a timespan such as minutes, hours, seconds. |
Custom Transforms
Attribute transform algorithms can be implemented by creating a new implementation of either IUnaryDataTransformer or IBinaryDataTransformer.
IUnaryTrasnformer - Is applied to each input independently and the each value is then compared to each other using the assertion logic.
IBinaryTransformer - Is applied to both inputs with a result being passed to the assertion.
The example below illustrates the implementation to the "similarity" transform
Custom transforms can be implemented using any .NET language (C#, Pascal, Iron Python, Visual Basic, etc).
Chaining Transforms
Transforms can be chained, for example, if you want to extract the Given name and then compare levenshtein difference of a Patient's mother's name:
This configuration will consider the Mother's name (relationship[Mother].target.name) and will extract (namepart_extract) the Given name, then apply a levenshtein difference. The result must be less than or equal to 2 (op=lte value=2)
Conditional Evaluation
You can also instruct the matching engine to only run or consider an attribute if another attribute is populated. For example, we may want to consider the patient's City address only if the state matches:
Note: Since the property path for both attributes are the same (address) the scoring engine will consider these two properties related, and will execute them as pairs. For example, if a patient has both a VacationHome and a Home address this configuration will consider VacationHome as one address for both attributes and Home. If using address.component[State].value and address.component[City].value instead then the scoring algorithm would treat these as two different property paths.
Partial Scoring
It is possible to apply a partial score to a match attribute. The understand why this may be beneficial, consider the match attribute configuration based on the Double-Metaphone code of a patient's name:
Under such a configuration, a patient with the name Kimberly may be compared to existing patients as illustrated in the table below:
Name | Gender | DOB | Address |
Kimberleigh Smith | F | 1992-01-12 | 123 Main Street |
Kimberly Smith | F | 1992-01-15 | 321 Brant Street |
Kimber Smith | F | 1992-01-10 | 222 West Street |
When evaluating a match the a attribute configuration will generate the Double Metaphone code of KMPR for all of these records, even though the candidate being considered doesn't match. Here it may be important to configure a partial score based on the levenshtein string difference between the original names and applying the percentage of difference to the calculated match score.
Here, the matching engine would alter the given scores for each of the patients under consideration:
Name | namepart_extract | dmetaphone | assertion | similarity | Score |
Kimberleigh Smith | Kimberleigh | KMPR | PASS | 0.429 | 42% of match score |
Kimberly Smith | Kimberly | KMPR | PASS | 1.0 | 100% of match score |
Kimber Smith | Kimber | KMPR | PASS | 0.715 | 71.5% of match score |
Last updated