Hashing a credit card number for use as a fingerprint

Ok, so if this is a business requirement that you must meet (check to see if a card already exists in the system) this is how I would do it.

First, PCI-DSS does allow for a PAN (the primary account number, or credit card number) to be stored in hashed form using "strong cryptography" and they recommend (but do not require) that a salt be used as well. PCI-DSS v3 § 3.4(PDF link) The reason is to inhibit the ability of a malicious party who gets ahold of the hash from determining the PAN that was used as the input for the hash.

Second, your plan to store this in a separate, secure system and expose it only via an API is a good defense in depth measure. If I were implementing this, I wouldn't have it send a hash back to the primary system, I would simply take the hash as a parameter for the API call and have the API return "true" or "false" depending on whether the hash already exists in the system or not.

For the hashing function, the key to preventing the recovery of the inputs (as required by PCI-DSS) is to chose an algorithm that allows you to make the process computationally hard, so the task of brute-forcing the relatively small input space of valid PANs reasonably slow. SHA-512, suggested by another answer, is not this algorithm. Looking at the test matrix for hash cracking provided by hashcat.net the fastest machine in their matrix can calculate nearly 2 billion SHA-512 hashes per second. With an estimated input space of approximately 30,000,000,000,000 cards**, that means that unsalted, you could compute the SHA-512 of each of those PANs in just over 4 hours, and have the card numbers for ever card in the database, should it ever be stolen.

So, what we need is a slow hash function, however the definition of slow changes all the time as computers get faster and technology evolved. This means that a secure hash function will be one with an adjustable work factor, so over time you can increase the computational effort required to compute the hash of a given input. There are several candidate algorithms that meet this criteria including PBKDF2, bcrypt, and scrypt. The general consenus for password hashing is bcrypt, and that is what I would suggest here. For the cost, set it as high as your application allows. It likely isn't something you're going to be computing terribly frequently (I'd imagine, depending on your application, you're going to be looking at x hashes per minute or hour or day rather than per second) and the higher the cost, the harder it will be for an adversary who captures the hash to brute force.

** My estimate is based on a 9 digit account number, a computable check digit that doesn't increase complexity and a WAG of 30,000 active bank identification numbers (the first 6 digits of the card) based on a list I found.

I cannot comment on how Stripe does this but I can tell you exactly how Braintree does it (because that is where I work). If I had to guess, Stripe probably uses a similar method.

In the Braintree API, we offer a unique number identifier for a credit card. This identifier is a random opaque token that will always be the same for card number stored in our system. The seed for this number is different per merchant in our system so you cannot compare them across merchants.

When a new card comes in, we look it up by comparing it to a hashed + salted column. If it matches that existing column we know we can return the same unique number identifier. If it doesn't match any existing record, we use a cryptographically secure pseudo-random number generator to create a new unique number identifier and ensure it doesn't conflict with an existing one.

This way the hashed + salted value never leaves our backend but we can still provide a way for a merchant to uniquely identify stored credit cards.

Hashing credit card numbers is not a substitute for securing the data. If your system isn't secure enough to store raw credit card numbers then it's not secure enough to store CC hashes.

Same thing for anything that's a fixed size and limited character set: date of birth, phone number, zip, etc.

Credit cards are all sixteen digits and while 10^16 seems like a big number, it's pretty straightforward for compute resources. If I have your table of CC# hashes, I can easily attack that with a rainbow table and get a full list of credit cards. Salting the values before you hash makes searching impossible.

All possible CC values: something less than 10^16.

For comparison, imagine your users used a reasonable 8 character password with a 20-byte salt: 62^28.

(If I was your PCI auditor I would consider hashing credit cards to be tantamount to storing the card numbers themselves.)