Emrfs file sync with s3 not working
It turned out that I needed to run
emrfs delete s3://bucket/folder
first before running sync. Running the above solved the issue.
Mostly the consistent problem comes due to retry logic in spark and hadoop systems. When a process of creating a file on s3 failed, but it already updated in the dynamodb. when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.
If you want to delete the metadata of s3 which is stored in the dynamaoDB, whose objects are already removed. This are the steps, Delete all the metadata
Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps
emrfs delete s3://path
Retrieves the metadata for the objects that are physically present in s3 into dynamo db
emrfs import s3://path
Sync the data between s3 and the metadata.
emrfs sync s3://path
After all the operations, to see whether that particular object is present in both s3 and metadata
emrfs diff s3://path
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html