Generalizations of the Birkhoff-von Neumann Theorem
I am cheating a little to give this answer, because I am fairly sure that it is part of Gil's motivation in asking the question. The most natural generalization of the Birkhoff hypothesis to quantum probability is only true for qubits. (It might also be true for a qubit tensor a classical system; I did not check that case.)
A quantum measurable space is a von Neumann algebra. We are most interested in the finite-dimensional case, where classically "measurable space" is just a fancy name for the random variables on a finite set. A finite-dimensional von Neumann algebra is a direct sum of matrix algebras. In particular, $M_2$ is called a qubit and $M_d$ is called a qudit.
To make a long story short, the Birkhoff hypothesis can be stated for a direct sum of $a$ copies of $M_b$, or $aM_b$. In this setting, a doubly stochastic map $E$ is a linear map from $aM_b$ to itself that preserves trace, that preserves the identity element, and that is completely positive. In this setting, $E$ is completely positive if it takes positive semidefinite elements of $aM_b$ to positive semidefinite elements, and if $E \otimes I$ also has that property on the algebra $aM_b \otimes N$ for another von Neumann or $C^*$-algebra $N$. The natural analogue of permutation matrices are the *-algebra automorphisms of $aM_b$. These are permutations of the matrix blocks, composed with maps of the form $E(x) = uxu^*$, where $u$ is a unitary element of $aM_b$. The question as before is whether the doubly stochastic maps are the convex hull of the automorphisms.
This Birkhoff hypothesis is true for $M_2$, false for $M_d$ for $d \ge 3$, and I should check it for $nM_2$. It is true for $aM_1 = a\mathbb{C}$, because then it is the usual Birkhoff-von Neumann theorem.
I am left wondering about two infinite classical versions of Birkhoff's theorem, for the algebras $\ell^\infty(\mathbb{N})$ and $L^\infty([0,1])$. In the former case, one would ask whether any stochastic map that preserves counting measure (even though counting measure is not normalized) is an infinite convex sum of permutations of $\mathbb{N}$. In the latter case, whether any stochastic map that preserve Lebesgue measure is a convex integral of measure-preserving permutations of $[0,1]$. Addendum: At least the discrete infinite case is addressed, with generally positive results, in this review and in this older review. The older paper also raises the continuous question but with no results. However, with some more Googling I found this counterexample paper.
Since Gil asks for a reference, a recent one is Unital Quantum Channels - Convex Structure and Revivals of Birkhoff's Theorem, by Mendl and Wolf.
Here also is a more orthodox combinatorial generalization of the Birkhoff theorem, and also another case that I once encountered that is between a generalization and a non-generalization. Since Gil now offers a bounty, maybe it's better to merge this answer with the other one.
A doubly stochastic matrix can be interpreted as a flow through a directed graph, with unit capacities. (See Unimodular matrix in Wikipedia; I learned about this long ago from Jesus de Loera.) Any such graph has a polytope of flows, called a network flow polytope. Any network flow polytope has integer vertices, because it is a totally unimodular polytope.
A totally unimodular polytope is a polytope whose facets have integer equations, and with the property that any maximal, linearly independent collection of facets intersects in an integer point because their matrix has determinant $\pm 1$. In particular the vertices are such intersections, so the vertices are all integral. This is a vast generalization of Birkhoff's theorem that comes from generalizing one of the proofs of Birkhoff's theorem.
Example: An alternating-sign matrix is equivalent to a square ice orientation of a square grid. The square ice orientations can be defined by a network flow, so you obtain an alternating-sign-matrix polytope. The generalized Birkhoff theorem in this case says that every vertex of the polytope is an alternating-sign matrix, in fact that every integer point of the $n$-dilated polytope is a sum of $n$ alternating-sign matrices.
The other case that I encountered was the polytope of fractional perfect matchings of a non-bipartite set with $2n$ elements. By contrast, the Birkhoff polytope is the case of a bipartite set with $n$ elements of each type. By definition, it is the polytope of non-negative weights assigned to the edges of the complete graph on $2n$ vertices, such that the total weight at each vertex is 1. Strictly speaking, the Birkhoff theorem is false; not every vertex is a perfect matching. Instead, all of the vertices are combinations of matched pairs, and odd cycles with weight $\frac12$.
At first glance this looks like bad news for the application of computing a perfect matching or the optimum perfect matching of a graph. Indeed, if instead you take the convex hull of the perfect matchings, the result is a polytope with exponentially many facets. However, a good algorithm exists anyway; there is a version of the simplex algorithm that only ever uses polynomially many of the facets.
I have found a paper on a generalization of the Birkhoff-von Neumann theorem here:
http://cowles.econ.yale.edu/conferences/2009/sum-09/theory/che.pdf
The authors are Eric Budish, Yeon-Koo Che, Fuhito Kojima, and Paul Milgram.
Here is the Abstract:
The Birkhoff-von Neumann Theorem shows that any bistochastic matrix can be written as a convex combination of permutation matrices. In particular, in a setting where n objects must be assigned to n agents, one object per agent, any random assignment matrix can be resolved into a deterministic assignment in accordance with the specified probability matrix. We generalize the theorem to accommodate a complex set of constraints encountered in many real-life market design problems. Specifically, the theorem can be extended to any environment in which the set of constraints can be partitioned into two hierarchies. Further, we show that this bihierarchy structure constitutes a maximal domain for the theorem, and we provide a constructive algorithm for implementing a random assignment under bihierarchical constraints. We provide several applications, including (i) single-unit random assignment, such as school choice; (ii) multi-unit random assignment, such as course allocation and fair division; and (iii) two- sided matching problems, such as the scheduling of inter-league sports matchups. The same method also finds applications beyond economics, generalizing previous results on the minimize makespan problem in the computer science literature
I have also found a master's thesis that involved generalization from a matrix to a hyper matrix, a matrix in higher dimensions. So one example would be a cubic array of numbers instead of a square. He proves a generalization to the three dimensional matrices which are called blocks. There are some open questions there as well. I found it interesting as I have wondered about extending the two dimensions of matrices to three coordinates and seeing what happened. It is available here:
https://ritdml.rit.edu/dspace/bitstream/1850/5967/1/NReffThesis05-18-2007.pdf
There is a relevant paper by Gromova regarding high dimensional matrices:
Gromova, M. B., The Birkhoff-von Neumann theorem for polystochastic matrices. (Russian) Operations research and statistical simulation, No. 2 (Russian), pp. 3–15, 149. Izdat. Leningrad. Univ., Leningrad, 1974. It was translated to English in the early 90s.
The MR review by by George P. Barker reads: In Section 1 the author gives the basic definitions and introduces a notion of the spectrum of a multidimensional matrix as a vector. The basic theorem which gives necessary and sufficient conditions for a vector to be the spectrum of an extremal matrix is then formulated. The necessary condition of the basic theorem consists of a direct generalization of the Birkhoff-von Neumann theorem.
A related paper is: Brualdi, R. A.; Csima, J. Extremal plane stochastic matrices of dimension three. Linear Algebra and Appl. 11 (1975), no. 2, 105–133.
and an older paper with some relevance is: Jurkat, W. B.; Ryser, H. J. Extremal configurations and decomposition theorems. I. J. Algebra 8 1968 194–222.