Problem

If you are working on service which need to aggregate place names from different sources, like Google Places, Foursquare and local db for example, then you always have a problem not to display same place more then one time. Imagine you have Paolo Pizza at Backer Street 43 and all three data sources return it in search result, then you need to show only one from preferred data source.

Possible solution

Solution for me was a two step algorithm.

If

  1. Distance between places less then 200 meters
  2. Their names are similar

Then

       It is same Places

With 1 everything is clear as I have lat,long for each place and can calculate distance with simple Haversine algorithmus

    double EARTH_RADIUS = 6371
    /**
* Calculate distance based on Haversine algorithmus
* (thanks to http://www.movable-type.co.uk/scripts/latlong.html)
*
* @param latitudeFrom
* @param longitudeFrom
* @param latitudeTo
* @param longitudeTo
* @return distance in Kilometers
*/
    BigDecimal calculateDistance(BigDecimal latitudeFrom, BigDecimal longitudeFrom,
                                 BigDecimal latitudeTo, BigDecimal longitudeTo) {

        def dLat = Math.toRadians(latitudeFrom - latitudeTo)
        def dLon = Math.toRadians(longitudeFrom - longitudeTo)

        //a = sin²(Δlat/2) + cos(lat1).cos(lat2).sin²(Δlong/2)
        //distance = 2.EARTH_RADIUS.atan2(√a, √(1−a))
        def a = Math.pow(Math.sin(dLat / 2), 2) +
Math.cos(Math.toRadians(latitudeFrom)) *
Math.cos(Math.toRadians(latitudeTo)) * Math.pow(Math.sin(dLon / 2), 2)
        return 2 * EARTH_RADIUS * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))

    }

The most interesting was to find the best way to compare place names. After some googling I stopped on three algorithms JaroWinkler, SmithWatermanGotoh, Soundex and lib which support all of them – simmetrics. I made a test to find a best algorithm for my needs

/**
* Made test of three word comparison algorithms:
* JaroWinkler, SmithWatermanGotoh and Soundex
*
* Based on test result Soundex is chosen to be used in venue title merging algorithm
* It is more usable, so that I can say if similarity > 0.98 then is same word
*/
    void testMergeAlgorithms() {
        def cafeName1 = "Antonio Cafe"
        def cafeName2 = "Antonio's Cafe"
        def cafeName3 = "Antonio Kafe"
        def cafeName4 = "Antonio hotel"
        def cafeName5 = "Mary Coffee"
        def algorithm = new JaroWinkler()
        assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 0.9809524f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 0.9777778f
        assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 0.96031743f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.92564106f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.55707073f

        algorithm = new SmithWatermanGotoh()
        assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 0.9f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 0.8666667f
        assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 0.76666665f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.7f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.3272727f

        algorithm = new Soundex()
        assertTrue algorithm.getSimilarity(cafeName1, cafeName2) == 1.0f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName3) == 1.0f
        assertTrue algorithm.getSimilarity(cafeName3, cafeName2) == 1.0f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName4) == 0.9444444f
        assertTrue algorithm.getSimilarity(cafeName1, cafeName5) == 0.5555556f
        assertTrue algorithm.getSimilarity(cafeName2, cafeName5) == 0.5555556f
    }

Just to explain: I want an easy way to see that Antonio Cafe is same to Antonio’s Cafe and same to Antonio Kafe but differs from Antonio hotel and from Mary Coffee.

Soundex algorithm appeared the most appropriate for my needs. And in code I use boundary 0.98 when comparing two words

import uk.ac.shef.wit.simmetrics.similaritymetrics.Soundex


boolean isSamePlace = false
Soundex soundexAlgorithm = new Soundex()
if(distanceBetweenVenues(place1, place2) < 200 && soundexAlgorithm.getSimilarity(placeName1, placeName2) > 0.98) {
    isSamePlace = true
}

That’s our way of solving such a problem. If you have any other experience with similar problems, please post it in comment.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

1,565 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 
© 2012 REID Consulting Suffusion theme by Sayontan Sinha