ESP32 ADC not good enough for audio/music?
I agree with Justme's answer that the DNL/INL is quite high, and also call your attention to this sentence:
By default, there are ±6% differences in measured results between chips.
This matters less for audio applications where the DC level can be normalised later, but for other applications it will definitely require calibration.
There is also a general concern for any ADC on the same chip as digital circuitry: power supply rejection ratio (PSRR). This is the ability to avoid conducting noise from the power supply into the analog results. Since no mention is made of using an external reference voltage, or separate analog power/ground, or a PSRR number, I suspect this is very bad.
What that is likely to mean that every power burst from the radio side gets conducted straight into the audio.
Ultimately the easiest thing to do is try it; if it's not up to your percieved quality standards, you'll want an external ADC with its own little linear regulator and separate area of the PCB. But the built-in ADC is probably adequate for phone-quality speech or music played through tinny cheap speakers.
Audio is about more than just sampling rate and bits. Other parameters are significant as well, such as linearity, monotonicity, noise, distortion, etc.
The ESP32 ADC has DNL of +/- 7 counts. It means that for any voltage measured, the result can be wrong by that amount. This already means the ADC may have missing codes and may not be monotonic.
The ADC measurements are also performed while there is a 100nF capacitor filtering a DC signal that is measured.
So while it could be used to sample audio, it would require a lot of analog signal conditioning to filter and buffer the signal into ESP32 with low enough impedance, and perhaps using oversampling and signal processing to get acceptable quality audio that cannot even reach 12 bits.
So it would be far simpler to just connect a simple audio ADC chip to the I2S bus and it would easily exceed CD quality in terms of bits, linearity, SNR and sampling rate.
Dynamic Range expressed as Effective Number of Bits (ENOB) is another way of expressing Signal to Noise Ratio (SNR):
ENOB: SNR = 6.02*N + 1.76 [dB]
So with resolution
N=24 SNR=146dB
N=20 SNR=122dB
N=16 SNR=98dB
N=14 SNR=86dB
N=12 SNR=74dB
N=8 SNR=50dB
We often use this in characterizing the noise floor of an ADC system, where N is the actual number of bits, and ENOB is the "effective" number of bits. A 16-bit system with 92dB SNR from various noise sources, is effectively comparable to a 15-bit system with noise only from quantization error. There are other noise sources besides quantization error, we just use ENOB to express effective number of bits because quantization error is the one noise source that we can never get rid of.
While dB is a general-purpose ratio unit, for audio applications it is related to the Sound Pressure Level or loudness of a sound. At 16-bit resolution, the ratio between the loudest expressible signal sound and the "noise level" of the inherent quantization noise, would be 98dB -- painfully loud. So 16-bits is enough resolution to capture good quality audio, at least in terms of dynamic range.
However, a 12-bit resolution ADC system has at best a signal-to-noise ratio of only 74dB, so while it would be able to capture sound at some level, the background hiss will be noticeable. For telephone it might be acceptable, but for music the background hiss would be objectionable.