How to filter out rows with NaN values in Hive?
Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.
Quoting that post...
(Float.NaN == Float.NaN)
always returns false.
In fact, if you look at the JDK implementation ofFloat.isNaN()
, a number is not-a-number if it is not equal to itself (which makes sense because a number should be equal to itself).
The same holds for Double.NaN
So, there is no point in showing you how to use the (undocumented) Hive function called reflect2
, which allows you to invoke raw Java methods on Hive columns, i.e.
where v1 is not null and not reflect2(v1, "isNaN")
...because -- in theory -- you can simply state:
where v1 is not null and v1=v1
Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.
In other words, if the simple v1=v1
clause does not filter out the NaN values as expected, then look into reflect2
...
Edit -- indeed, the optimizer appears to ignore the v1=v1
clause in some versions of Hive (see comments) so a more devious formula is necessary:
v1 +1.0 > v1
should work... except when rounding errors make eitherabs(v1)
<< 1 orabs(v1)
>> 1- other "numeric" tricks will fail similarly in edge cases, especially when
v1 =0.0
In the end, the most robust approach appears to try cast(v1 as String) <>'NaN'
(because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).
Side note about
reflect2
-- you can see that it is indeed not mentioned in the official Hive doc, while reflect
is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025
Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive
CLI.