Is there an efficient way for synchronising audio events real-time with LEDs using an MCU?
I am not fond of encoding command data as an analog signal in a digital file.
I think I would try something like encoding the lighting commands as text blocks in the lyrics block of the ID3 information inside the mp3 file.
The lyrics block is before the sound data, so you should be able to decode it quickly before you start playback.
Typical libraries for mp3 playback don't seem to read all ID3 tags. Some libraries read some tags. You could either extract the lyrics tags in your own read function before playback, or extend one of the existing libraries.
Inside the lyrics tags, you have timestamps and text.
You can encode the timestamps as the ID3 standards define them, or encode your own more precise timestamps in your own format (the ID3 timestamps are only given in seconds.)
The text content is the interesting part. Define your own text encoding for your lights. Say "DMX1:FFFF00000000" for full bright red light on address 1 (that's just encoding DMX data as hexadecimal, with the address included in the header.)
Or something simpler if you just need to turn on and off a handful of LEDs connected directly to the microcontroller.
Or implement your format such that you can use it for DMX, but have an interpreter in your controller for local LEDs.
The content is really up to you. In any case, it is much more flexible than embedding sounds in one channel of your audio file.
There are lyrics editors that you can use to put your commands into the mp3 files. Just type commands in your private format in as lyrics.
Using an auxiliary DSP for decompression may make this difficult unless you limit yourself to constant bitrate files - if you need high time accuracy of the light events you may need to account for the processing delay from when you put the data in until the sound comes out, which would be different at different bitrates.
Possibly you could work around a distinct decoder by doing independent timing, starting an MCU time counter at the start of the audio output, and triggering light events at appropriate timestamps. In that case you may want to encode your light data in its own file linked by a naming pattern, or embedded it in interwoven data that shows up a bit ahead of compressed audio it corresponds to and gets held in an MCU buffer until the indicated timestamp.
One potential drastic simplification is to store linear PCM .wav files instead of compressed MP3. Given that an audio CD is only about 3/4 of a gigabyte, even a cheap SD card potentially holds a few hours of uncompressed audio. If you have no compression it's pretty simple for your MCU to just clock the data out a DAC, though preferably use a hardware timer driven DAC (and potentially DMA) or at least an interrupt, not a software delay loop.
I once developed an application in which arbitrary events could be triggered by audio watermarks embedded in the sound. You could play the sound through tinny unamplified computer speakers at one end of the conference room table, and my demo box sitting at the other end would turn on LEDs at exactly the correct moments.
It was efficient in the sense that the decoder ran on a tiny 8-bit (6502-based) microcontroller, supported by a simple analog signal processing chain (mic preamp, filter, etc.)