What are the technical reasons behind the "Itanium fiasco", if any?
Itanium failed because VLIW for today's workloads is simply an awful idea.
Donald Knuth, a widely respected computer scientist, said in a 2008 interview that "the "Itanium" approach [was] supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."1
That pretty much nails the problem.
For scientific computation, where you get at least a few dozens of instructions per basic block, VLIW probably works fine. There's enough instructions there to create good bundles. For more modern workloads, where oftentimes you get about 6-7 instructions per basic block, it simply doesn't (that's the average, IIRC, for SPEC2000). The compiler simply can't find independent instructions to put in the bundles.
Modern x86 processors, with the exception of Intel Atom (pre Silvermont) and I believe AMD E-3**/4**, are all out-of-order processors. They maintain a dynamic instruction window of roughly 100 instructions, and within that window they execute instructions whenever their inputs become ready. If multiple instructions are ready to go and they don't compete for resources, they go together in the same cycle.
So how is this different from VLIW? The first key difference between VLIW and out-of-order is that the the out-of-order processor can choose instructions from different basic blocks to execute at the same time. Those instructions are executed speculatively anyway (based on branch prediction, primarily). The second key difference is that out-of-order processors determine these schedules dynamically (i.e., each dynamic instruction is scheduled independently; the VLIW compiler operates on static instructions).
The third key difference is that implementations of out-of-order processors can be as wide as wanted, without changing the instruction set (Intel Core has 5 execution ports, other processors have 4, etc). VLIW machines can and do execute multiple bundles at once (if they don't conflict). For example, early Itanium CPUs execute up to 2 VLIW bundles per clock cycle, 6 instructions, with later designs (2011's Poulson and later) running up to 4 bundles = 12 instructions per clock, with SMT to take those instructions from multiple threads. In that respect, real Itanium hardware is like a traditional in-order superscalar design (like P5 Pentium or Atom), but with more / better ways for the compiler to expose instruction-level parallelism to the hardware (in theory, if it can find enough, which is the problem).
Performance-wise with similar specs (caches, cores, etc) they just beat the crap out of Itanium.
So why would one buy an Itanium now? Well, the only reason really is HP-UX. If you want to run HP-UX, that's the way to do it...
Many compiler writers don't see it this way - they always liked the fact that Itanium gives them more to do, puts them back in control, etc. But they won't admit how miserably it failed.
Footnote 1:
This was part of a response about the value of multi-core processors. Knuth was saying parallel processing is hard to take advantage of; finding and exposing fine-grained instruction-level parallelism (and explicit speculation: EPIC) at compile time for a VLIW is also a hard problem, and somewhat related to finding coarse-grained parallelism to split a sequential program or function into multiple threads to automatically take advantage of multiple cores.
11 years later he's still basically right: per-thread performance is still very important for most non-server software, and something that CPU vendors focus on because many cores is no substitute.
Itanium designed rested on the philosophy of very wide instruction level parallelism to scale performance of a processor when clock frequency limit is imposed due to thermal constraints.
But AMD Opteron DISRUPTED Itanium adoption by PROLIFERATING x86_64 cores to achieve scalable performance and also being compatible with 32bit x86 binaries.
Itanium servers are 10x expensive than x86 for similar processor count.
All these above factors slowed adoption of Itanium servers for the mainstream market. Itanium's main market now is a mission critical enterprise computing which is a good $10B+/year market dominated only by HP, IBM and Sun.
Simple. It wasn't x86 compatible. That's why x86_64 chips are.