How to evaluate Word2Vec model
There's no generic way to assess token-vector quality, if you're not even using real words against which other tasks (like the popular analogy-solving) can be tried.
If you have a custom ultimate task, you have to devise your own repeatable scoring method. That will likely either be some subset of your actual final task, or well-correlated with that ultimate task. Essentially, whatever ad-hoc method you may be using the 'eyeball' the results for sanity should be systematized, saving your judgements from each evaluation, so that they can be run repeatedly against iterative model improvements.
(I'd need more info about your data/items and ultimate goals to make further suggestions.)
One way to evaluate the word2vec model is to develop a "ground truth" set of words. Ground truth will represent words that should ideally be closest together in vector space. For example if your corpus is related to customer service, perhaps the vectors for "dissatisfied" and "disappointed" will ideally have the smallest euclidean distance or largest cosine similarity.
You create this table for ground truth, maybe it has 200 paired words. These 200 words are the most important paired words for your industry / topic. To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
I like this way better than the "eye-ball" method, whatever that means.