Python, PyTables, Java - tying all together

Do not make this more complex than it needs to be.

Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.

Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.

This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".


This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.

The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.

This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.

Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.

Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)

To answer your questions directly:

  • PyTables looks like a nice match.
  • There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.

To help with the design process more and flesh out your needs:

It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:

  • Create two diagrams of your current architecture, physical and logical.
    • On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
      • Be certain to label the resources available to each server and the type and resources available to each connection.
      • Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
    • On the logical diagram, create boxes for every application that is running in your current architecture.
      • Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
      • Draw on disk resources (like the HDF5 and CSV files) as cylinders.
      • Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
      • Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.

Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).

At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)

To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:

  • What data formats or methods can this application use to communicate.
  • What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
  • How often does it need it.
  • Approximately how much resources does the application need.
  • What does the application do now that it doesn't do that well.
  • What can this application do now that would help, but it isn't doing.

If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.

You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.

So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.

Some final guidelines:

  • Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
  • Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
  • Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
  • Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.

And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.