Need to find strings that contain the same word twice

You can use the Python collections module and an Update Cursor to accomplish this. This method adds a new field and populates it with a 1 if there are any duplicates, otherwise a 0 if there are no duplicates.

import arcpy, collections

shp = r'C:\temp\names.shp'

# Add a field called "check" to store binary data.

arcpy.AddField_management(shp, field_name = "check", field_type = "SHORT")

# Use an Update Cursor to query the table and write to new rows
# 1 = has duplicates
# 0 = no duplicates
with arcpy.da.UpdateCursor(shp, ["last_names", "check"]) as cursor:
    for row in cursor:
        names = row[0].replace("&", "").split() # Clean the string
        counts = collections.Counter(names) #create dictionary to count occurrences of words
        if any(x > 1 for x in list([count for name, count in counts.items()])):
            row[1] = 1
        else:
            row[1] = 0
        cursor.updateRow(row)

enter image description here

What about using re and set and setting a flag ( here 0 and 1) in python- re will extract all the names (last and first) from BENNETT MCCARL & ARNETTE BENNETT without &. For pattern matching re is of highest priority- you can use re how you want.

import re
def sorter(val):
    words = re.findall(r'\w+',val)
    uniques = set(words)
    if len(words)>len(uniques):
        return 1
    else:
        return 0

And call sorter( !N! )

demo

**See how regex grabs words at LIVE DEMO

Note that all of these answers deal the problem supposing that your data is sanitized i.e. have proper space between words but what if your data is something like BENNETTMCCARL&ARNETTEBENNETT then all these would fail. In that case you may need to use Suffix Tree algorithm and fortunately python has some library as here.

Assuming your source data is a FeatureClass/Table in a File GeoDatabase then the following query will select the rows you require:

SUBSTRING(name FROM 1 FOR 7) = 'BENNETT' AND SUBSTRING(name FROM (CHAR_LENGTH(name) - 6) FOR 7) = 'BENNETT

name is the field, I just happened to call it name. The first part is testing the left hand side the second part is testing the right. This query is obviously hard coded to search for BENNETT, if you need to select by other surnames hopefully you can work out what needs changing?

Need to find strings that contain the same word twice

Tags:

Arcgis Desktop

Arcgis 10.3

Attribute Table

Field Calculator

Python Parser

Related

Recent Posts