Problem
If you are working on service which need to aggregate place names from different sources, like Google Places, Foursquare and local db for example, then you always have a problem not to display same place more then one time. Imagine you have Paolo Pizza at Backer Street 43 and all three data sources return it in search result, then you need to show only one from preferred data source.
Possible solution
Solution for me was a two step algorithm.
If
- Distance between places less then 200 meters
- Their names are similar
Then
It is same Places
With 1 everything is clear as I have lat,long for each place and can calculate distance with simple Haversine algorithmus
double EARTH_RADIUS = 6371 /** * Calculate distance based on Haversine algorithmus * (thanks to http://www.movable-type.co.uk/scripts/latlong.html) * * @param latitudeFrom * @param longitudeFrom * @param latitudeTo * @param longitudeTo * @return distance in Kilometers */ BigDecimal calculateDistance(BigDecimal latitudeFrom, BigDecimal longitudeFrom, BigDecimal latitudeTo, BigDecimal longitudeTo) {
def dLat = Math.toRadians(latitudeFrom - latitudeTo) def dLon = Math.toRadians(longitudeFrom - longitudeTo)
//a = sin²(Δlat/2) + cos(lat1).cos(lat2).sin²(Δlong/2) //distance = 2.EARTH_RADIUS.atan2(√a, √(1−a)) def a = Math.pow(Math.sin(dLat / 2), 2) + Math.cos(Math.toRadians(latitudeFrom)) * Math.cos(Math.toRadians(latitudeTo)) * Math.pow(Math.sin(dLon / 2), 2) return 2 * EARTH_RADIUS * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))
}The most interesting was to find the best way to compare place names. After some googling I stopped on three algorithms JaroWinkler, SmithWatermanGotoh, Soundex and lib which support all of them – simmetrics. I made a test to find a best algorithm for my needs
/** * Made test of three word comparison algorithms: * JaroWinkler, SmithWatermanGotoh and Soundex * * Based on test result Soundex is chosen to be used in venue title merging algorithm * It is more usable, so that I can say if similarity > 0.98 then is same word */ void testMergeAlgorithms() { def cafeName1 = "Antonio Cafe" def cafeName2 = "Antonio's Cafe" def cafeName3 = "Antonio Kafe" def cafeName4 = "Antonio hotel" def cafeName5 = "Mary Coffee" def algorithm = new JaroWinkler() assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 0.9809524f assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 0.9777778f assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 0.96031743f assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.92564106f assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.55707073f
algorithm = new SmithWatermanGotoh() assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 0.9f assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 0.8666667f assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 0.76666665f assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.7f assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.3272727f
algorithm = new Soundex() assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 1.0f assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 1.0f assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 1.0f assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.9444444f assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.5555556f assertTrue algorithm.getSimilarity(cafeName2, cafeName5) == 0.5555556f }Just to explain: I want an easy way to see that Antonio Cafe is same to Antonio’s Cafe and same to Antonio Kafe but differs from Antonio hotel and from Mary Coffee.
Soundex algorithm appeared the most appropriate for my needs. And in code I use boundary 0.98 when comparing two words
import uk.ac.shef.wit.simmetrics.similaritymetrics.Soundex
boolean isSamePlace = falseSoundex soundexAlgorithm = new Soundex()if(distanceBetweenVenues(place1, place2) < 200 && soundexAlgorithm.getSimilarity(placeName1, placeName2) > 0.98) { isSamePlace = true}That’s our way of solving such a problem. If you have any other experience with similar problems, please post it in comment.