How to protect a global variable shared by isr and regular function?

Using volatile is often quoted as a solution, but this is not quite true. It will often mask a problem as volatile will always make code slower. If your only use is as shown, then volatile will probably work.

It is probably better with a single reader and single write to use memory barriers. This would be your code then,

Mainline:

volatile int *p = &flag;
while (*p == false);   /* You must use volatile if you poll */
flag = false;
asm volatile ("" : : : "memory"); /* gcc barrier */

isr:

/* do something */
flag=true
asm volatile ("" : : : "memory"); /* gcc barrier */

Here, the barrier just forces the compiler to do the ARM str instruction at that point. The optimizer will not move any code before or after. You can also use swp or ldrex and strex depending on your ARM CPU. As well, ring buffers are often used with ISR and mainlines as they don't need any special CPU support; only the compiler memory barrier.

See the lock-free and specifically search lock-free and arm.

Edit: For additions,

Is there a way I won't miss interrupt at all ?

This is dependent on the interrupt source. If it is a timer and you know the timer source can never be faster than XX instructions and no other interrupts are active in the system, then your current code will work. However, if the interrupt is from an external source like an Ethernet controller, a non-debounced keypad, etc. It is possible for multiple interrupts to come quickly. Some times new interrupts even happen during the interrupt handler. Depending on the ISR source, there are different solutions. A ring buffer is commonly used to queue work items from the ISR for the main line. For a UART, the ring might contain actual character data. It could be a list of pointer, etc. It is difficult to synchronize the ISR from the mainline when the communication becomes more complex; So I believe the answer depends on the interrupt source. This is why every OS has so many primitives and infra-structure for this issue.

How the memory barriers solve the issue, does it have effect when the code runs on single cpu ?

Memory barriers don't completely solve the missed interrupt issue; just like volatile doesn't. They just make the window much smaller. They force the compiler to schedule a load or store earlier. For example the main line loop,

  1: ldr r0, [r1]
     cmp r0, #0    ; xxx
     bne 1b        ; xxx
     mov r0,#1     ; xxx
     str r0, [r1]

If a 2nd interrupt happens during the xxx lines, then your flag should be set twice and you missed one interrupt. The barriers just make sure the compiler places the ldr and str close together.

What is the expected behavior when using barriers between different contexts?

The compiler memory barrier I show just makes the compiler do stuff sooner. It has no effect between contexts. There are different barriers; but mostly they are for multi-CPU designs.

Can a sleep in the while loop can solve problems of syncs?

Not really, this is just a more efficient use. The ARM WFI instruction can temporarily stop the CPU and this will save power. That is normally what sleep() does on the ARM. I think you need to change the communication between the ISR and the mainline, if this is an issue. That depends on the ISR source.