What is bit banging
Bit banging is creating the whole series of pulses in software, instead of relying on a piece of hardware inside the microcontroller.
Many microcontrollers have a hardware SPI, and then all you have to do is write a byte to the output register, and the SPI controller will shift the data out, and at the same time receive data from the slave. You can get an interrupt when the transfer is complete, and then read the received data.
But some microcontrollers don't have the SPI hardware on board and then you have to simulate it by doing everything manually. SPI has a number of different modes, I'll use this pulse diagram as an example:
So while a dedicated SPI controller takes care of all the pulses, data shifting and timing, when bit-banging you have to take every action yourself:
Make Slave Select low
Short delay
Do 8 times
Make the SCK (Serial Clock) pin low
Make the MOSI (Master-Out-Slave-In) pin high or low depending on bit 7 of the data
Add brief delay
Make the SCK output high
Read MISO (Master-In-Slave-Out) pin
Shift received data left, and shift the bit just read in as bit 0
Add brief delay
Shift the data byte 1 bit left
Make Slave Select high again
Bit-banging SPI is relatively simple, the code for bit-banging I2C for instance will be more complex, and you'll need a timer somehow if you want to bit-bang the UART protocol.
Bit-banging refers to the concept of having the signals which go out of or come into a device be generated/sampled by software rather than hardware. Obviously some hardware is required, but when using bit-banging, the only hardware for each output is a latch which can be explicitly set or cleared by software, and the only hardware for each input is an interface to allow software to test whether it is high or low (and typically execute a conditional branch for one state but not the other).
The maximum speed which can be achieved with bit-banging will generally be a fraction of what could be achieved with purpose-built hardware, but outside the limitations imposed by processor speed, bit-banging is much more versatile, and may be used in circumstances where general-purpose hardware is not quite suitable and special-purpose hardware would not be cost-effective.
For example, many controllers have an "SPI-style" port which behaves essentially as follows: when a byte is written to a certain register, the hardware will generate some number of clock pulses (typically eight), clocking out a data bit on the leading edge of each clock pulse and sampling an incoming data bit on the trailing edge. Generally, controllers' SPI-style ports will allow a variety of features to be configured, but in some cases it may be necessary to interface a processor with a device which does something unusual. A device may require that data bits be processed in multiples other than eight, or it may require that data be both output and sampled on the same clock edge, or it may have some other unusual requirement. If the particular hardware on the controller one is using can support one's precise requirements, great (some provide configurable numbers of bits, separately-configurable transmit- and receive timing, etc.) If not, bit-banging may be helpful. Depending upon the controller, bit-banging an SPI-ish interface will often take 2-10 times as long as letting hardware handle it, but if the requirements don't fit with the hardware one has, exchanging data more slowly may be better than not being able to do it at all.
One important thing to note with bit-banged designs is that it they are simplest and most robust in circumstances where either the devices being communicated with are waiting on the bit-banging controller to generate all their timing, or where the controller will be allowed to wait, without distraction, for an event to arrive, and where it will be able to do everything it needs to do with that event before any other event arrives that it needs to act upon. They are much less robust in circumstances where a device will need to be able to react to external stimuli within a relatively short time frame, but cannot device 100% of its energy to watching for such stimuli.
For example, suppose one wishes to have a processor transmit UART-style data serially at a rate which is very high relative to its clock speed (e.g. a PIC which is running 8,192 instructions per second wishes to output data at 1200 bps). If no interrupts are enabled, such transmission is not difficult (clock one bit every seven instruction cycles). If a PIC were doing nothing but waiting for an incoming 1200bps data byte, it could execute a 3-cycle loop waiting for the start bit, and then proceed to clock in the data at seven-cycle intervals. Indeed, if a PIC had a byte of data ready to send when an incoming byte of data arrived, seven cycles per bit would be enough time for the PIC to send its byte of data simultaneous with reading the incoming byte. Likewise, a PIC which initiated a 1200bps transmission would be able to look to see if the device it's communicating with was sending back a reply, if such a reply would have fixed timing relative to the original transmission. On the other hand, there would be no way for PICs that speed to handle bit-bang communications in such a way that either device was allowed to transmit at any time it saw fit (as opposed to having one device which could transmit when it saw fit, and do whatever it liked when not transmitting, and one device which would have to spend most of its timing doing nothing but waiting for transmissions from the first device).