Aggregating Polygons To Meet Privacy Requirements
For anyone who's curious, I came up with a solution on my own, using PySAL's region.Maxp algorithm. Essentially, Max-p allows me to generate a set of regions that meets my first criterion (minimum number of employers per region), and I put that inside a while loop, which will reject any of Max-p's solutions that don't also satisfy the second criterion (percentage of employment contributed by the largest employer in a region). I've implemented it as an ArcGIS tool.
I decided to scrap the work I'd done previously to flag blocks/blockgroups/tracts and instead run Max-p on blocks (although I've been doing all my testing on tracts, as a modest increase in the number of input polygons has a dramatic effect on processing time). The relevant part of my code follows. The "shapefile" required as input for the generate_regions()
function (passed as a string containing the full path of a shapefile) is one that has had the employers point features already spatially joined to it, with the number of employers, maximum employees from a single employer, and total employees stored as an attribute for each input feature.
import arcpy, math, pysal, random
import numpy as np
# Suppression criteria:
MIN_EMP_CT = 3 # Minimum number of employers per polygon feature
MAX_EMP_FRAC = 0.8 # Maximum ratio of employees working for a single employer per polygon feature
def generate_regions(shapefile, min_emp_ct=MIN_EMP_CT, max_emp_frac=MAX_EMP_FRAC):
'''Use pysal's region.Maxp method to generate regions that meet suppression criteria.'''
w = pysal.rook_from_shapefile(shapefile, idVariable='GEOID10')
dbf = pysal.open(shapefile[:-4] + '.dbf')
ids = np.array((dbf.by_col['GEOID10']))
vars = np.array((dbf.by_col[employer_count_fieldname],dbf.by_col[max_employees_fieldname],dbf.by_col[total_employees_fieldname]))
employers = vars[0]
vars = vars.transpose()
vars_dict = {}
for i in range(len(ids)):
vars_dict[ids[i]] = [int(vars[i][0]),float(vars[i][1]),float(vars[i][2])]
random.seed(100) # Using non-random seeds ensures repeatability of results
np.random.seed(100) # Using non-random seeds ensures repeatability of results
bump_iter = int(arcpy.GetParameterAsText(3)) # Number of failed iterations after which to increment the minimum number of employers per region (otherwise we could be stuck in the loop literally forever).
iteration = 0
tests_failed = 1
while tests_failed:
floor = int(min_emp_ct + math.floor(iteration / bump_iter))
solution = pysal.region.Maxp(w,vars,floor,employers)
regions_failed = 0
for region in solution.regions:
SUM_emp10sum = 0
MAX_emp10max = 0
for geo in region:
emp10max = vars_dict[geo][1]
emp10sum = vars_dict[geo][2]
SUM_emp10sum += emp10sum
MAX_emp10max = max(MAX_emp10max, emp10max)
if SUM_emp10sum > 0:
ratio = MAX_emp10max / SUM_emp10sum
else:
ratio = 1
if ratio >= max_emp_frac:
regions_failed += 1
iteration += 1
if regions_failed == 0:
arcpy.AddMessage('Iteration ' + str(iteration) + ' (MIN_EMP_CT = ' + str(floor) +') - PASSED!')
tests_failed = 0
else:
arcpy.AddMessage('Iteration ' + str(iteration) + ' (MIN_EMP_CT = ' + str(floor) +') - failed...')
return solution
solution = generate_regions(spatially_joined_shapefile)
regions = solution.regions
### Write input-to-region conversion table to a CSV file.
csv = open(conversion_table,'w')
csv.write('"GEOID10","REGION_ID"\n')
for i in range(len(regions)):
for geo in regions[i]:
csv.write('"' + geo + '","' + str(i+1) + '"\n')
csv.close()
I have never come across a situation like this, and I believe a more common route is to actually keep whatever units you decide on a priori and then use different techniques to "fudge" the data to protect privacy concerns.
For an introduction into the myriad of ways people mask data I would suggest this article;
Matthews, Gregory J. & Ofer Harel. 2011. Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys 5: 1-29. The PDF is freely available from Project Euclid at the above link.
I also have some links to various other articles that discuss "geomasking" at that tag at my citeulike library (not all are strictly related to geographic data though).
Although this doesn't answer your question, it is possible some of the techniques listed in the Matthews and Ofer article may be easier to implement to meet your needs. In particular the making of synthetic data seems like a logical extension of where you were going (the external data would be borrowed from the surrounding census block group or tract or county if necessary). Also some types of data swapping (in space) may be easier to implement as well.