Fast algorithms for calculating the distance between measures on finite ultrametric spaces
The standard way to quickly approximate Wasserstein distances is to use entropic regularization. Gabriel Peyre and Marco Cuturi wrote a good book on this topic which is available on the Arxiv at https://arxiv.org/abs/1803.00567 (or on Peyre's website). The relevant part is Chapter 4.
However, I'm not sure if there is an extra gain from considering an ultrametric space.
This is a rather more fun problem than I thought. I must apologize though, as your question is a reference request and I have no references apart from pointing at any textbook on discrete optimization. It turns out, the key is that one can rewrite your problem into a flow problem on a tree, which then is almost trivial to solve. Thus, if I am not mistaken, not only is your upper bound $\hat{\rho}$ the correct value for $\rho$, but the same is true for many other heuristic ways to construct an upper bound. The ultrametric seems to try its best to actively prevent you from accidentally choosing bad solutions and you can use this to define some algorithms which should be almost optimal.
Preliminaries
I think the problem is easier to understand in the transport formulation (which is the dual of the one used in the question): $$ \rho(\mu,\eta) := \min \left\{ \int_{X \times X} d(x,y) \,dT : T \in P(X\times X), T(.,X) = \mu,T(X,.)=\eta\right\} $$ i.e. $T(A,B)$ tells us how much mass is transported from $A$ to $B$. I will mostly use this and some derived formulation, but it is good to have both around. In particular, if you have an $f$ for the formulation in the question and a $T$ for this formulation which both give you the same value, you know that both have to be optimal.Futhermore, we can assume that $\operatorname{supp} \mu \cap \operatorname{supp} \eta = \emptyset$, as transporting from a point to itself is free. In fact, I will not assume that $\mu$ and $\eta$ are probability measures but only that $\mu(X) = \eta(X)$, which works equally well with all definitions and allows us to easily substract similar amounts from both without having to renormalize in every step. In fact in this context it can be useful to consider the signed measure $\nu = \mu -\eta$ instead, which sufficiently describes both of them.
The tree problem
As far as I can gather, any ultrametric can be written in form of a tree (rooted, as used in computer science), where the leaves correspond to the points of $X$ and each subtree to a set of balls containing precisely the points that are its leaves. One can then assign a distance $d_e$ to each edge $e \in E$ of the tree such that the distance between two points in $X$ corresponds to the length of their connecting path through the graph.
One can rewrite finding the WKR-metric into a flow problem on the tree: Extend $\mu$ to the interior nodes by $0$. Now we need to find a flow, i.e. an assignment of a direction and a value $p_e$ to each edge (It is simpler to assume a fixed direction, say upward in the tree and a signed $p_e$ instead) such that in each node $n$ the total of in and outgoing flow corresponds $\nu(n)$. The cost of such a flow then is given by $\sum_e d_e |p_e|$.
The interesting fact about this problem is that on a tree, such a flow is always unique. Also the cost of the unique flow is identical to the WKR-metric. In fact you can recover an $f$ with identical resulting value by assigning a fixed value to a given node $v$ and the recursively setting $f(w) = f(v) \pm d_{(v,w)}$ for all its neighbours, where the sign depends on the direction of the flow. Similarly, you can recover a $T$ by splitting the flow into a sum of weighted paths between leaves and setting $T(\{(x,y)\})$ to the weight of that path. If you take care to never have any cancellation (which is always possible), the corresponding value will again be the same as the cost of the flow.
A fast algorithm given a tree
There are fast algorithms to calculate an optimal flow in graphs, but as we only require the cost of the flow, there is an easy recursive algorithm to calculate it along the tree. For each subtree, we simultaneously construct the internal cost of the flow the flow that leads upwards from it. The total cost then is the internal cost of the whole tree.
For each leaf $x$, the internal cost is 0 and the flow upwards is $\nu(x)$.
For each subtree, we can recursively calculate internal cost and flow upwards of all of its child trees. The internal cost of the subtree then is the sum of internal costs of its child trees plus the sum of the absolute values of the flows from each of those children multiplied by each respective distance. The flow upwards is simply the sum of all signed flows from the children.
This algorithm only visits each node in the tree once and does a rather simple calculation there, so I'd argue that it is next to optimal. In particular as there are always more children than internal nodes in a tree, it is of order $O(|X|)$. I also believe it is equivalent to the heuristic in the question.
A fast algorithm without a tree
If we do not have the tree structure but are instead only given the distance function, we do not need to calculate the tree. Instead there is a faster way to get to the same value by a simple greedy algorithm:
- Find the pair of nodes $x,y$ with $\mu(\{x\}) > 0$ and $\eta(\{y\}) > 0$ such that $d(x,y)$ is minimal.
- Add $d(x,y)\min(\mu(\{x\}),\eta(\{y\}))$ to the total cost and reduce $\mu(\{x\})$ and $\eta(\{y\})$ by $\min(\mu(\{x\}),\eta(\{y\}))$
- Repeat until $\mu=\eta =0$
If initially one creates a binary heap of all distances this needs a runtime of order $O(|X|^2\log |X|)$. Then in each iteration this algorithm reduces $\operatorname{supp} \mu$ or $\operatorname{supp} \eta$ by a point, so it will run at most for $|X|$ iterations and in doing so remove all elements from the heap again in runtime $O(|X|^2\log |X|)$. As there are a potential $O(|X|^2)$ of distance values to check I'd argue that this again is close to optimal.
The reason why this algorithm returns the right result is evident if one considers the graph in parallel. In each iteration you can add the path between $x$ and $y$ with weight $\min(\mu(\{x\}),\eta(\{y\}))$. When the algorithm finishes, the sum of those paths then gives the flow and one can show that no cancellation occurs. The idea is that the tree is kind of filled from the bottom and a path of minimal distance starting can only ever leave a subtree, if either $\mu$ or $\eta$ is already zero on this subtree, so there will be no future path coming in the opposite direction.
Other distances
A fun observation I had while writing this: At least with Wasserstein-distances, one is generally interested in $d(x,y)^p$ for some $p \in [1,\infty)$ as a cost instead of just $d(x,y)$. But if $d$ is an ultrametric, so is $d^p$, so the whole argument gets adapted easily.