Why do we consider Borel sets instead of measurable sets?

Well i'd say it depends of the context but one reason that come to my mind is that the borel $\sigma$-algebra is simpler (and smaller) than the Lebesgue $\sigma$-algebra $\mathscr{M}(\mathbf{R})$. For a lot of things the seting of borel functions or borel sigma alegbra is enough for what you want to do, using the Lebesgue sigma algebra would only make the proofs harder or even invalidate the results you want to prove.

An example about the "harder proofs" parts : the $\sigma$-algebra $\mathscr{B}(\mathbf{R})$ is generated by the open sets of $\mathbf R$, and a lot of proofs use this fact. Unfortunatly the situation is more complex with $\mathscr{M}(\mathbf R)$.

An example about the "invalidating results" part : It's easy to show that if $f$ and $g$ are Borel then $f\circ g $ is also Borel. However, if you define a measurable function to be a function $f$ such that for every open set $U\subset \mathbf R$ you have $f^{-1}(U)\in \mathscr M (\mathbf R)$ then the composition of two measurable functions is not measurable in general.

Side note : the fact that the composition of a two measurable functions is not measurable is closely related to the fact that some functions are Borel but not Lebesgue (where $f$ is Lebesgue mean $f^{-1}(U)\in \mathscr M (\mathbf R)$ for every $U\in \mathscr M (\mathbf R))$. There is a exercice in Folland's Real analysis about that if i remember it right. But $\mathscr M (\mathbf R)$ is absolutely crucial in integration theory, indeed there are functions that are Riemann integrable but not Borel (think of the characteristic functions of some subset of the triadic cantor set).

To finish, yes $\mathscr M (\mathbf R)\backslash\mathscr B (\mathbf R)$ is nonempty. But you have the following result :

if $A\in \mathscr M (\mathbf R)\backslash\mathscr B (\mathbf R)$ then there exists two borel sets $M$ and $N$ such that $M\subset A$, $A\subset M \cup N$ and $\lambda(N)=0$ (so $A$ is a borel set up to some non Borel negligible set). Moreover one have $\lambda(A)=\lambda(M)$.


I'm not sure what you have in mind when you say example, but if you look in a basic (undergraduate level) probability book you'll see they really struggle with the fact that you can't give a probability to any arbitrary event. The question then becomes what subset of $2^{\mathbb R}$ you want to consider. There's some desire to be as broad as possible, but the basic machinery that you need to develop is quite difficult and perhaps too difficult for the typical student of classical probability who will probably never encounter in the real world an event that is not a Borel set. In find in teaching probability that even Borel sets are too complicated for the typical student who is a scientist who just wants to test the significance of their data. For such people they will probably never encounter an event that's not an interval, or at most a union of two or three intervals. But such students would be lost in the difficult details of analysis necessary to include more sets than they would actually ever need. That's why some authors, e.g. Larson, side step the entire issue of non-integrable sets completely, and just warn the student that not all subsets of $\mathbb R$ can be events and then just move on.