Java language detection with langdetect - how to load profiles?
I have the same problem. You can load the profiles from the LangDetect jar using JarUrlConnection and JarEntry. Note in this example I am using Java 7 resource management.
String dirname = "profiles/";
Enumeration<URL> en = Detector.class.getClassLoader().getResources(
dirname);
List<String> profiles = new ArrayList<>();
if (en.hasMoreElements()) {
URL url = en.nextElement();
JarURLConnection urlcon = (JarURLConnection) url.openConnection();
try (JarFile jar = urlcon.getJarFile();) {
Enumeration<JarEntry> entries = jar.entries();
while (entries.hasMoreElements()) {
String entry = entries.nextElement().getName();
if (entry.startsWith(dirname)) {
try (InputStream in = Detector.class.getClassLoader()
.getResourceAsStream(entry);) {
profiles.add(IOUtils.toString(in));
}
}
}
}
}
DetectorFactory.loadProfile(profiles);
Detector detector = DetectorFactory.create();
detector.append(text);
String langDetected = detector.detect();
System.out.println(langDetected);
Since no maven-support was available, and the mechanism to load profiles was not perfect (since you you need to define files instead of resources), I created a fork which solves that problem:
https://github.com/galan/language-detector
I mailed the original author, so he can fork/maintain the changes, but no luck - seems the project is abandoned.
Here is an example of how to use it now (own profiles can be written where necessary):
DetectorFactory.loadProfile(new DefaultProfile()); // SmProfile is also available
Detector detector = DetectorFactory.create();
detector.append(input);
String result = detector.detect();
// maybe work with detector.getProbabilities()
I don't like the static approach the DetectorFactory uses, but I won't rewrite the full project, you have to create your own fork/pull request :)
Looks like the library only accepts files. You can either change the code and try submitting the changes upstream. Or write your resource to a temp file and get it to load that.