Revealing the Power of Legacy Machine Data

Download Slides

ENGEL, which was founded in 1945, now is the leading manufacturer for injection moulding machines on the global market. Since then, and especially in the current era, the amount of data has grown immensely and has also become more and more heterogenous due to newer generations of machine controls. Taking a closer look at the conglomerations of each and every machine’s log files, one can find 13 different types of timestamps, different archive types and more peculiarities of each control generation. Apparently, this has led to certain problems in automatically processing and analysing the data.

In this talk, you will explore how ENGEL managed to centralise this data in only one place, how ENGEL set up a data pipeline to ingest batch-oriented data in a streaming fashion and how ENGEL migrated their pipeline from an on-premise Hadoop setup to the cloud using Databricks.

Together with Oliver Lemp, Data Scientist for ENGEL, dive into the journey of integrating legacy data where you will learn how to manage the following aspects:

  • Ingesting classified and heterogenous data from field engineers
  • Unwrapping legacy data with native libraries in Spark
  • Moving a batch-oriented architecture to a streaming architecture
  • Partitioning and maintaining non-time series data for millions of variables

Speaker: Oliver Lemp


– Hey everyone, and welcome to Revealing the Power of Legacy Machine Data. My name is Oliver, and I’m happy to take you on a journey about dealing with Legacy Machine-Data at ENGEL. But before we start off, I would like to give you a small introduction about ENGEL first. So ENGEL, is Austria’s largest machine manufacturer and we focus on building machine or injection molding machines. We’re in fact, we’re also the market leader for injection molding machines. One might ask, what are injection molding machines? So basically, injection molding machines are the machines that are used to produce plastic products. ENGEL is set up in five different business units. And we start off with automotive, where for instance, you have in your car, car panels and so on, these kinds of things can be produced by our machines. Then we have the medical area, packaging, technical molding and also teletronics where our machines, for instance, produce frames for mobile phones or even TVs. To set the scene, we like to talk about the ENGEL customer service. So our customer service works in the following way. The customer calls our hotline, so, this is our first-level support when there’s a problem with our machine. The first level support then tries to solve the problem on the line. If he can, then it’s fine. Then he can close the ticket immediately, but if he cannot solve the problem immediately, he has to send the field engineer at some point. The field engineer is then advised to collect so-called error reports that we use to analyze errors on the machine. So, the field engineer and the first level support are in constant communication in trying to fix the problem. This mostly only happens if we have a more complex problem that’s related to more analytic stuff and so on that we really need to investigate. So when ENGEL headquarters team has finally solved the problem or identified the problem, his then able to guide the field engineer to solve the problem at the customer site. And this whole process is documented in our ticket system where we also make these so-called error reports. And back then all these analytics were done using Excel. So, Excel was our main tool of choice at ENGEL back then, everything was time losing Excel. We were not really using any classified data. It was just a whole mess of data that we worked through with Excel. So, and there, they were able to compare parameters. So if there were any wrong settings at the machine, then the service engineer can somehow derive if there’s an issue on the machine related to these parameters. But it might also be the case that the machine was just under such high pressure that it simply broke down. And this is where we had an idea that we could use to actually centralize all the data that we have with these error reports in one place and use one central analytics tool or a so-called DIY tool that we can then use to solve the problems in need. And this is where we thought, why not simply use a central self-service tool that everyone is able to use. So for instance, if the customer has a problem, then you can upload a report with this self service tool. If the first level support identifies a problem, he can also solve the problem on the line or even the field engineer if he wants to. So, this is our main use case where we want to assist our customer service. So we are built here on three main pillars. So the first one is that we want to use, data scientific methods to assist our customer support. So with these error reports and our ticket system, we have a classified error documentations that are attached with symptoms, errors, and solutions that we can use in future to automatically infer the errors that actually happen on the machine. Thus, we can reduce maintenance and repair times, and also try to predict future errors using live data for instance, and also discover serial defects in an early stage and inform our customers directly, which machines are affected. And last but not least probably one of the most important things is that we want to generate a sustainable knowledge base. With that, we can ensure a fast onboarding process of new employees since we have this self service tool that they can use to solve the problems and they can focus on fixing the problems efficiently. And on top of that, of course, we can use this data source to build data-driven solutions on top of it. And this is where we started off. So basically our data was distributed anywhere around ENGELs. So we had several mail servers, we had network shares with network drives, we had central servers, we had SAP, we had Hadoop. So basically everything was distributed. And this is where we started off and tried to somehow think of a solution where we can centralize the source of data to make it accessible for ever-grow. Because in this current setup, it was definitely not possible to search for a specific status report on a specific ticket case from our customer support. So this is where we needed to do something. And, before we go into details about how we centralize this, I want to give you more detailed insights of our status reports themselves. So basically, such an error report or also status report is a zipped collection of serialised log files, where log files may be the snapshot of the machines parameters, and so, the settings that are configured at the machine, but also for instance, the log file which contains the last 10 errors on the machine. And many more different log files. And we mainly use these error reports for fault discovery and also for documentation purposes. So as I’ve already mentioned before, our service engineers can make use of these error reports to read the errors from these reports, if they are able to, and if they can find the errors, then they can also provide a solution on top of it. And last but not least, we have been collecting these error reports since 1990, so this data format has definitely grown by a lot, and it started from a simple text file. And it’s now ranging from SIP files to TCC files and everything that you could imagine is being, stored in these status reports. And then these newer versions of these, error reports, we have a recursive archive structure, so we have zip files and sip files, and then again, we have sip files, so it’s a very heavily recursive archive structure that we have here. And then this archive structure, we can find log files, which most of them are binary serialised so we have some kind of custom civilization that was used to serialise the log files. And most of the time, non-standard tools where you were so nervous nothing like FLC realization or parquet or anything else that you can imagine like Creo. It was just plain custom binary serialization that you have to write a custom algorithm to be able to deserialize the data first. Then we have memory dumps and also the parameters snapshots. And then the image on the right hand side, you can see such a parameter snapshot, which contains a collection of all the variables that are configured on the machine, group by their corresponding components of the machine. So we have several root notes where for instance, this air pressure monitoring one, has several variables beneath it, and we have intermediate melts, and then we have to leave notes, which are the actual variables. And this is our parameter snapshot file. But I will get back to this file at a later stage. lets now talk about the issues that we had with these error reports, since it’s such a data source that has historically grown by a lot, we’ve now accumulated about 13 different timestamp formats that we have to deal with. And on top of that, as I’ve already mentioned, we have a different structure for each controller generation on the machines we have. So as I’ve said, we have text files, we have sip files, we have TGC files, everything that comes into your mind, and then, in the next stage you discover that you have 13 different timestamps format. So you can imagine how difficult it is to automatically process this kind of data especially when there are broken archives, because the error report was not successfully drawn, might be due to a software bug or anything else. And also we have missing files and many other many other different things that we had to handle first to be able to process this kind of data. So, which tool is able to kind of deal with this data. This is what we were thinking at this stage. And then we discovered that there is one software stack in the field that we could use, it is Hadoop. Most of you are probably familiar with it. And Hadoop back then was very much in fashion, there was hortonworks, hortonworks data flow and data platform that was very promising at first, and it had our core components like Apache NiFi, Spark, Kafka and HDFS that we were using as our core components. And we immediately started off to prototype it on a raspberry Pi cluster. So we had, five raspberry pies that we were using to set up a basic cluster and we tested all patents on there. And yes, it worked really well. And we were able to run it on this raspberry Pi cluster. So why shouldn’t we deploy them in an established on premise environment, on dedicated hardware in an on-premise environment. And this is what we did next. And here we have our initial prototype of our initial data flow solution. So the first part that we introduced that was also used in the first production scenario, and what basically happens is that the service engineer uploads his report via a third party tool through a rest endpoint to Apache NiFi, which we use as a data ingestion and routing too. So what NiFi does in this case, it simply writes the data to HDFS and also writes through Kafka. And the reason why we write to HDFS and Kafka is because Kafka on its own is not able to handle such huge files that we have with these error reports, because one error report can contain up to 30 megabytes. It is not really reasonable to store this kind of data into Kafka. That’s why we simply put the file path into Kafka and then read the data back from HDFS. And from Kafka onwards, we were able to stream using Spark streaming, where we had two Spark jobs where one was responsible for processing the metadata about these error reports, and one was responsible for processing these parameters snapshots as I’ve shown you before. We use HBase as a serving layer and also our HDFS using Spark applications to serve BI, web and mobile apps which you can see here. And of course, in this architecture as there is always problems, we also discovered some difficulties that we encountered here. So first of all, we had to maintain streaming and batch jobs. Since we have so much historical data in this use case, we somehow need to process or reprocess the data very much. And this leads to our kind of Lambda architecture, where we have to maintain both batch and streaming shops. Then of course we had the Kafka and the large files problem as I’ve already mentioned. This led to kind of a work around in our Spark streaming chops, as you can see on the right hand side. So we first read the JSON from the Kafka topic, then we read from the Kafka topic the file path to HDFS. And then from HDFS, we were able to read back the data into an actual RDD. And also again, here we had to use RDD’s because, Spark 2.x was not really able to support binary files without using the RDD API. So this was also one drawback that we found here. And then after the Kafka and the large files problem, of course, we discovered the Hadoop and the small files problem. Because Hadoop is not really able to handle small files, that’s why we had to kind of write the repartitioning job that takes the small files and merges them together to bigger files. And so this is kind of the opposite problem that Kafka has some Kafka likes small data and HDFS doesn’t like small data. So, this had to be treated. And, of course there was this problem with legacy binary deserialisation where we had to use custom libraries that are written on Pascal. And just, I really am talking about Pascal here that we have used with JNA to attach them to the Java virtual machine and use them in a spark chip. And of course, this was not the only difficulty to understand Pascal, so it was also very troublesome to make Pascal work with the parallelism of Spark. So, this was another layer of difficulty that was added here. And lastly, there’s this unpredictable parameters snapshots. So, the parameters snapshots, I also like to call the memory bombs because, some of them are two megabytes of size on disk and then they unfold up to 800 megabytes in memory or also on disc. And this is per report and it also varies very strongly. So, we have reports maybe from the older generation where we only have 200,000 variables, but the newer ones also contain up to 3 million variables per reports. So this was also very challenging and we also had to deal with this kind of data to store it. And, we didn’t really find a suitable database that was able to deal with this data. And this is where we went as classic data Like approach and where we simply start the data as parquet files in our HDFS. And for that, we didn’t really use the tree structure because it’s not really useful to navigate through this tree and for this use case, so we had to flatten the tree. We simply concatenated along the tree to the leaves by using dots and the variable name and the right image you can see how the final structure looks like. So, we have some metadata attached to this variables, and then we have the real values and also the units for these variables. So we needed to petition these system variables as we call them also. We thought at first, why not simply group them by the machines components and centrally publishing by that because that’s what usual machine manufacturers would think of. But in this case, we were not really able to do so because it led to a very heavy skew because some of the machines components only contained 10 variables while others contained about 100,000 variables. So, it’s really various from the machine components. And we had to search for a different solution in this regard. And this is where we invented our custom hash-based partitioning function, where we sent, it took the first three parts of the variable name, so we could easily split by adopting just take the first three variable part names and simply hash it using an md5 function. And then we were able to use the output of this function to assign petitions to the corresponding variables. So this is kind of a particular use case because we’re not really talking about time series data, and most of the time, when we talking about petitioning, you see the typical petitioning scheme by date and also month and so on. But in this case it’s not really time series data it’s just data that is drawn on demand, and we simply had to deal with it. And this is where we invented this hash-based partitioning function. It turns out to work really well. We had very balanced petitions but the only issue we faced is that regex curious and also creators that asked for multiple variables of the same machine component. Were not very efficient, because then we had to again such through multiple petitions because of the hashes randomness, and this is where we had to think of a different solution because their efficiency was really lost and we had to search through all petitions and these use cases. And after all this, architecture didn’t really work out well for us, so we discovered many issues that we had here are only some of them. So first of all, when using Hortonworks, you have these big data big bang upgrades and migrations, where you simply have to upgrade the whole software stack or nothing. Apart from that, it wouldn’t work. And this is where we had our troubles because nobody really had to know how all of these tools in this software stack actually work. So, this was not really reasonable for us and migrations took a very long time. Secondly, job monitoring was basically non-available one excuse for us might be that there are so many software components running on this Hadoop stack that we were not really able to distinguish how the resources were divided between these different software tools. And, it was really hard to set up the monitoring that’s why we had basically no monitoring, which is also very fine, if you think of it. Then, we had to write the repartitioning job. The repartitioning job is responsible because of the Hadoop small files problem, where we had to reconcile all the small files into bigger files, so that our Hadoop name notes doesn’t get too much pressure from all the files that we store in there. And as you might already guess when using such huge batch jobs, that process 180 K’s data supports where we have this high memory pressure because of these parameters snapshots, you can imagine of how many out of memory errors we got, as you can see on the right image, where we always should boost the young memory overhead. And we always posted it and it simply didn’t turn out to work because in the end it always fails. And of course, if the task was running 30 hours, it would fail at the 29th hour so this was really cool to debark hence ends to fix the problem, it was a really time consuming process. So after all, this whole architecture felt like a big workaround and we thought there must be a better way to actually solve this architecture. There must be someone who supports this use cuisine’s this is where we started to think of the cloud and the cloud as a possible solution. And this is where we are now. So we have migrated everything to the cloud, which would be a topic for a talk on its own, but let’s keep it at that. And again, here we have our field engineer who is uploading the report again to Apache NiFi now running on Kubernetes and NiFi simply stores the data now as a blob on the Azure data Lake storage, and here comes the cool part where we don’t need Kafka anymore. We can simply use Databricks and to auto load a feature, to stream from, our data source. And we don’t need to listen to Kafka and then read from HDFS again. So this was all removed by the very great feature of Auto Loader. And, then we only had one Spark chip left. We simply merged the remaining two Spark chips into one and used Cosmos DB as a serving layer and also Delta Lake with Databricks Spark to, serve our BI web and form the apps. But how are we dealing now with these system variables that I’ve mentioned before? So basically we wanted to be able to carry by the machines components, and this is where we had to think of a different petitioning scheme. So I call this the equi-distant range partitioning scheme. I’m not sure if there’s an official term for that, but what it basically do is we sought the variables and then we cut off after a specific variable and move the next variables into the next petition and so on. So for this to work, of course, we had to first initially determine the number of variables that belonged to each partition, and we had to create a lookup map so that we were able to design each variable to a petition and this is how it worked out. And in fact, it works out very well. So we are now able to issue rec ex careers and also do this component related careers, but also point look of careers are still as performant as before. And of course we are now able to use data breaks, optimize feature, and see ordering. So we didn’t really have to write every partitioning job anymore in the cloud. Everything was solved by using Databricks as our Spark environment. To work towards an optimal solution, so are we really speaking of an optimal solution right here? For the moment, I think we are speaking of one because we have reduced our complexity by a lot. We now only have one configurable spark shop that can do everything. It’s only one streaming shop, so we kind of unified the batch and streaming paradigm where we simply can use Auto loader to stream from the Azure data Lake storage. So basically we were able to use the Kappa architecture, if you want to name it. Then we were able to ditch Kafka completely. So we had one dependency less to maintain, which is also quite nice, if you think of it. Then we had reduced memory pressure because of this whole streaming architecture and with Spark micro batches, we were now able to reduce our memory pressure by a lot, which for this use case is very essential because we have these memory bombs and we simply had to deal with it. And by using micro batches, this problem just fades away easily. So this, in general led to more stable jobs in our environment. And lastly, we finally had monitoring for our Java virtual machines and the ganglia metrics that are lost so much from Databricks. So this is very handy and I can only recommend these ganglia metrics. They are really useful, to actually see the pressure on the virtual machines behind Spark environment. To get back to the use case at ENGEL. So the current state is that we have established several self service tools using these pipelines. As I’ve just mentioned. Here you can see one dashboard that we’ve created where our service engineers can see the overview of all state posts that we have. And we are also able to send a PDF summary to our field engineers with the relevant steps that they could take to actually solve the problem at the customer site immediately. Or he could also include other things that he could also fix on the machine. For example, clearing the memory, clearing the log files and so on. So basically a summary of steps to improve the machine’s health, states. And in future, we also want to use these error reports with our connection to the ticket system, to actually learn from the data with the fault history and so on. So that we are able to, when a user uploads a report, we’re able to automatically detect and classify the error and send back the possible or the relevant steps that our field engineer needs to do to solve the problem at the customer. And before I close this talk, I would like to leave you with a few points. So, first of all, to never underestimate the effort that is put into process legacy-data. And the unknown is just too daunting. So many unforeseeable things will happen. So we never knew in the beginning that there were 13 different times that formats, but legacy data just grows historically so much that you have so much history on this data. And so many different engineers that have worked with this data, and this just whole thing grew so much that we are now at this state, that we are able to deal with it and I can now work to improve it in future. Secondly, the change management to actually use the centralized approach and have this DIY era analytics tools that everyone can use, is very demanding. So it’s a time consuming process that you need to live through and the whole agriculture. And I think we have finally reached a point where we have achieved this, and everyone is kinda seeing the benefits out of this architecture that we have now, and that everyone is able to access the data. So, these self service tools, and they’re really understood by our company and they are really favoring it to the previous solution because they can now see all the re posts that they need. And then of course we have the unmanaged cluster thingie. So it was just pure frustration for us because we developers had to maintain the clusters on our own. And this is not a task that we should do. We should focus on developing applications on top of our data and not really spend our time with maintaining the current shops that are running on our infrastructure. So I definitely do not recommend that you as a developer, try to maintain such a class. It’s just pure frustration. And of course this whole complexity was then removed by moving it to the cloud where we have fully managed services. And also data breaks helped us a great deal in setting up this whole pipeline and everything seems to work out quite well now. All right, then that’s it from my side. I like to thank you very much for your attention. My name is Oliver and please do not forget to leave some feedback. Thank you, Bye.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Oliver Lemp

ENGEL Austria GmbH

With a strong foundation in Arts and Bioinformatics, Oliver has gathered further knowledge in the fields of web development and data science at various (startup) companies and research institutions. In his previous internships/employments he was focusing on bioinformatics algorithms and NLP on social media data.

After that, he made a hard change in industries and has been working at ENGEL, Austria’s largest machine manufacturer, for 2 years now as the leading engineer for data science. Since then, he is facing new challenges every day (e.g., understanding injection moulding machines) and is trying to organize and make sense of machine-data – while attempting to adopt the value of data science and creativity in the traditional machine manufacturing sector.