Where exactly is the USB spec that explains what to do when the cable is first connected?
USB has several layers, which are described in the USB 2.0 Specification. If you're familiar with the OSI layered network model, you can think of it like this:
- Session layer = Chapter 10 USB Host Hardware and Software (device drivers)
- Transport layer = Chapter 9 USB Device Framework
- Network layer = Chapter 8 Protocol Layer (bitstream)
- Data Link layer = Chapter 7 Electrical (circuit)
- Physical layer = Chapter 6 Mechanical (cable and connector)
Conceptually USB is based on streams of data, called Endpoints, which can be either IN (to the host) or OUT (from the host). Every device has Endpoint 0, which is used for control and status. A device may have additional endpoints for application data. Each endpoint behaves like a FIFO buffer.
Data is transferred on an endpoint either as Bulk (like TCP/IP, guaranteed that every byte arrives and in the correct order), or Isochronous (like UDP/IP, guaranteed to be fresh but may drop packets). There is a misleadingly named "Interrupt" transfer type, which is really just polled by the host.
USB 2.0 uses a differential pair for datalink. I won't go into much detail as this is covered by the USB 2.0 spec chapter 7. Generally on the PCB layout we treat this as a matched-length, differential pair, and put in the series resistors required by whatever USB PHY (Physical Interface) is being used. USB peripheral uses a high value resistor on one of the D+ or D- lines to notify the host that it is a high-speed or low-speed peripheral.
Soon after the USB host discovers that a device is present, the host requests a bunch of descriptors from the device. This is taken care of behind the scenes by the FTDI chip. The descriptos are described in Chapter 9.5. These include Device Descriptor, Configuration Descriptor, Interface Descriptors, Endpoint Descriptors, String Descriptors, maybe even HID Report Descriptors.
The Device Descriptor includes the USB VID (Vendor Identification) and PID (Product Identification) numbers. The operating system uses this pair of numbers, VID_PID, to determine which device driver shall be used for this device. Note that the VID number is issued by having membership in the USB implementors forum, so that's kind of a problem if you're an individual inventor.
Additionally, there is the HID (Human Interface Device) class driver, which provides somewhat generic input for keyboard/mouse/etc, as well as any generic input/output. One advantage of HID is that it does not require providing a custom device driver, but its throughput is somewhat limited compared to a custom bulk driver. There is a whole other specification document about the HID descriptors; and a HID Usage Table document that details all of the code numbers that describe the various features available on a given human interfaced device.
FTDI chip such as FT220X datasheet provides the USB "serial interface engine" (not to be confused with SPI serial or RS232 serial). This takes care of most of the low-level stuff described in chapters 6, 7, and 8.
FTDI uses an EEPROM (offchip on the FT2232H, on-chip on the FT220X) to contain a little bit of the information that goes into the descriptors. You can customize the VID/PID values, and provide custom description strings.
Behavior and interaction of USB "partners" (a host and a device) is scattered across USB specification. The best way to get some grounds is to look at "device framework", Chapter 9, which describes the possible (mandated) device states (Figure 9-1), and Host (and Hub) framework, in Chapters 10 and 11. Ignoring protocol details (pipes/transaction types/abstract OSI protocol layers, PCB layout, etc.), a better grip on initial interaction can be achieved by studying the port state diagram (Figure 11-10).
In essence, if cable is not connected between host and device, the host ports are in "Powered State" (VBUS is ON), but "Disconnected". D+ and D- wires are held low with 15k pull-downs.
When the cable is connected, the VBUS goes into device. The device recognizes that it is being connected, and signals a "connect" event by pulling HIGH one of D wires, D+ if it is FS/HS device, and D- if it is LS device.
Pull on D+/- wires on a certain port gives an interrupt to host software, reporting "port status change". The host software (usually ehci.sys) then initiates "port reset" sequencing on that particular port. Upon successful completion of "USB port reset", the host port is enabled for USB communication. The port becomes active (frame packets starts flowing out).
Using USB protocol, the host assigns unique address to this device, and reads "device descriptor". This starts "device enumeration" process. The device descriptor contains information about which device class it belongs to (HID, COM, MIDI, Printer, etc.), and VID/PID of that particular device, plus a bunch of other info, see Table 9-8.
After getting the device class and VID/PID, the host software tries to match this information in device registry, and loads the corresponding DEVICE driver, either a generic one, or vendor-specific (if it exists). The device driver then finishes the enumeration process by selecting device interface ending with setting "device configuration". Obviously the entire USB communication gets recognized behind this particular port only, even if all packets are broadcasted to all enabled ports.
The above is the general framework of USB connect protocol. Packetizing data for any particular purpose (like MIDI) is a different story, and it is handled either at application level, or at device driver level, if the system get proper device class. To get native MIDI communication, the device must have this class in its descriptor and follow all MIDI class definitions.