Computing the first $n$ values of the Liouville function in linear time
Your problem is related to the problem of finding all the primes up to $n$, and it can be solved by similar means.
Although the traditional sieve of Eratosthenes has complexity $O(n \log \log n)$, there are improved versions, which work in $O(n)$ time. It is achieved by crossing out each composite number exactly once. For the purpose of finding primes, these algorithms can even be optimized to $O(n /\log \log n)$ time by the so-called wheel optimization. You can find details in e.g. this paper.
In order to solve your problem, you have to calculate the least prime factor function (denoted by $lpf$) during sieving, not only the primes. When the $lpf$ function is ready, you can compute completely multiplicative functions in $O(n)$ easily by dynamic programming, e.g.: $$ \lambda(x) = \begin{cases} -1 & x = lpf(x) \\ -\lambda \left( \frac{x}{lpf(x)} \right) & \text{otherwise} \end{cases} $$
One of the algorithms which calculates least prime factors in $O(n)$ time is described here. Below you can see its implementation in C++, with primes
being a sorted list of all primes found so far, and lpf
being an array representing the same-named function.
vector<int> primes;
vector<int> lpf(n, -1);
for (int x = 2; x < n; x++) {
if (lpf[x] < 0) { //prime found
lpf[x] = x;
primes.push_back(x);
}
for (int i = 0; i < primes.size(); i++) {
int p1 = primes[i], y = p1 * x;
if (p1 > lpf[x] || y >= n)
break;
lpf[y] = p1;
}
}
Consider a number $x$ having least prime factor $p = lpf(x)$. For each prime $p_1 \le p$, it is easy to see that $lpf(p_1 \cdot x) = p_1$. The algorithm simply applies this crossing-out rule for each number $x$ in increasing order. Of course, it needs the sorted list of primes in order to iterate over all the necessary primes $p_1$.
Every composite number $y$ is crossed out exactly once when considering number $x = \frac{y}{lpf(y)}$, so time complexity is $O(n)$. In practice however, it is slower than the traditional $O(n \log \log n)$ sieve, at least for the purpose of finding primes. I guess your approach would also be faster, especially with bitsets.
If you are interested in practical acceleration of your algorithm, you'd better think about memory/cache optimization, instead of improving asymptotic complexity by a double-log factor =)