FPGA - synchronise “very close” clock from signal
No two clocks will ever perfectly match. The method of determining the true clock frequency from the data is called "clock recovery".
If you know the nominal bit rate, then one straightforward method that doesn't require use of a PLL/DCM block is to over-sample the data and look for edges. Normally you would need to over sample by at least 4X the bit rate. Here is how it works...
Create a clock in your part that is 4X the bit rate. In the case of a 65MHz bit rate this is a 260MHz clock.
Using the 260Mhz clock double or triple register the incoming bits to avoid meta-stability issues. These types of issues can occur if an input signal changes very close to a clock edge. This is almost guaranteed to happen when sampling data using a different clock from which the data was generated.
Optionally do an extra two register stages and do a 2 out of 3 majority vote on the last three stages. This will reduce false detection of edges due to noise, which becomes important in the next step since you are using edges in the data to find the clock rate.
Make a two bit free running counter that counts from 0 to 3 and then rolls over back to 0. The counter is clocked by the 260MHz clock.
Whenever you see a 0 to 1 or 1 to 0 transition on the input data then assume you are at a clock edge and reset the counter to 1 (cnt <= "01").
Whenever the counter has a value of 2 (cnt = "10") use the output of your majority vote as your input sample. And if you are keeping a pixel count, also increment that.
I have personally used the above method to successfully recover the clock on serial data up to 100Mbps.
Depending on whether the incoming data is slightly faster or slower than your clock the counter will skip one tick or hold an extra tick to adjust the count rate to match the data.
For a slower data rate you will see something like...
...0,1,2,3,0,1,2,3,0,1,1,2,3,0,1,2,3,0,1,2,3...
For a faster data rate you will see something like...
...0,1,2,3,0,1,2,3,1,2,3,0,1,2,3,0,1,2,3...
There is another method where you can do the 4X over-sampling using two clocks that are the same rate as your pixel clock but 90 degrees out of phase. By sampling into four registers (one on rising and one on falling for each clock) you can achieve the same effect as is done with the counter based setup above. The maximum possible pixel rates are higher for that method, but the logic is a little more complex.
It would be useful to think a bit about how a video card actually produces its output signals.
See for example an XFree86 modeline. Here's one for 1024x768 @ 60 Hz (non-interlaced) from an online generator tool, the details are implementation specific, but the idea holds across essentially all computers and video modes.
Dot Clock Frequency: 60.80 MHz
Modeline "1024x768" 60.80 1024 1056 1128 1272 768 768 770 796
You have a pixel clock, a region of active video, a sync signal defined as its start and end, and a total number of clocks in a line period, in this example 1272. All of these are expressed in units of pixel clocks, which is to say that everything traces back to the pixel clock, and does so in a digitally consistent way.
So for a given video mode on a given computer, the proportions are stable, it's only the pixel clock rate itself which has a slight drift with temperature, aging, etc.
So basically, if you could know the modeline numbers, then you could divide your own pixel clock oscillator by the right number of clocks for the sync signal to match (1272 in the example above), and so have a PLL which locks your pixel clock to the source video card's. All you have to do then is count in the right number of pixels to the left edge of the active region.
How could you find the modeline numbers? Well, you personally could probably look them up.
What I suspect a modern pixelized (LCD etc) monitor driven from an analog source does, is do something of a "search" when you press the image-tune button. It's probably not hard to guess the horizontal resolution, and you can measure the duration of the active video. Then you can measure the ratio of the overall horizontal period to the active video, and from that you know the approximate overall number of pixel clocks in a modeline - eg, in my example 1272 determined in time proportion to the 1024. So you lock your PLL at the guess of 1272 (or thereabouts).
You then try various numbers around that and do some sort of comparison to see at which your sampled data seems "best", where "best" may be defined as something like showing the highest average of pixel-to-consecutive-pixel level difference near both sides of the screen, indicating that the sampling is well aligned to the middle of the pixel periods and not catching the intermediate transitions between them, and doing so across the screen such that it matches in both phase and frequency. Or maybe it just checks that the active region is exactly the right number of pixel clocks wide. But if you can afford to generate a clock that's something say 3x-6x the actual pixel clock, then you have an easy mechanism of fine tuning steps to match the best sampling phase relative to the sync signal by dividing it using a counter with an phase offset initialized each synch - that's probably useful as the driver behavior of the sync won't have the exact same clock-to-out delay as that of the RAMDAC analog outputs.
Probably this analysis is something you'd do with software on a processor core (internal or external to the FPGA) that's allowed to poke registers which control your capture state machine, and run non-realtime statistics on a frozen line buffer; eg, basically it gets to treat the capture hardware like a digital scope.