Need to find strings that contain the same word twice
You can use the Python collections module and an Update Cursor to accomplish this. This method adds a new field and populates it with a 1
if there are any duplicates, otherwise a 0
if there are no duplicates.
import arcpy, collections
shp = r'C:\temp\names.shp'
# Add a field called "check" to store binary data.
arcpy.AddField_management(shp, field_name = "check", field_type = "SHORT")
# Use an Update Cursor to query the table and write to new rows
# 1 = has duplicates
# 0 = no duplicates
with arcpy.da.UpdateCursor(shp, ["last_names", "check"]) as cursor:
for row in cursor:
names = row[0].replace("&", "").split() # Clean the string
counts = collections.Counter(names) #create dictionary to count occurrences of words
if any(x > 1 for x in list([count for name, count in counts.items()])):
row[1] = 1
else:
row[1] = 0
cursor.updateRow(row)
What about using re
and set
and setting a flag ( here 0 and 1) in python- re
will extract all the names (last and first) from BENNETT MCCARL & ARNETTE BENNETT
without &
. For pattern matching re
is of highest priority- you can use re
how you want.
import re
def sorter(val):
words = re.findall(r'\w+',val)
uniques = set(words)
if len(words)>len(uniques):
return 1
else:
return 0
And call sorter( !N! )
**See how regex
grabs words at LIVE DEMO
Note that all of these answers deal the problem supposing that your data is sanitized i.e. have proper space between words but what if your data is something like BENNETTMCCARL&ARNETTEBENNETT
then all these would fail. In that case you may need to use Suffix Tree algorithm and fortunately python has some library as here.
Assuming your source data is a FeatureClass/Table in a File GeoDatabase then the following query will select the rows you require:
SUBSTRING(name FROM 1 FOR 7) = 'BENNETT' AND SUBSTRING(name FROM (CHAR_LENGTH(name) - 6) FOR 7) = 'BENNETT
name
is the field, I just happened to call it name
. The first part is testing the left hand side the second part is testing the right. This query is obviously hard coded to search for BENNETT, if you need to select by other surnames hopefully you can work out what needs changing?