U.S. Citizenship and Immigration Services (USCIS) is the government agency that oversees lawful immigration to the United States. USCIS seeks to secure America’s promise as a nation of immigrants by providing accurate and useful information to our customers, granting immigration and citizenship benefits, promoting an awareness and understanding of citizenship, and ensuring the integrity of our immigration system. To keep up with the growing demand for timely and efficient data accessibility for immigration, USCIS had to continuously improve, evaluate, streamline and revise our data analytics processes.
Apache Spark with Databricks, cloud native data lake with Delta Lake, MLflow, and other tools became a crucial factor for the success of our Agency’s programs such as Electronic Immigration System (ELIS), eProcessing, Operational and case status reporting, Fraud detection, Refugee, Asylum, and International Operations (RAIO), Forecasting, etc. by liberating the data. Prior, USCIS was overwhelmed by legacy systems with operational data stores and a dated data warehouse containing disparate datasets from mainframes to relational databases to unstructured data all needing continuous updates. Fortunately, there was a way to remain current with the source systems with a more reliable platform to reduce risks, while providing better function and containerized applications. We were outgrowing our traditional relational database capacity to service the added demand from the user community and stakeholders. Although a recent move to the cloud improved our capability, we required a dynamically scalable platform that could adapt and cater to the growing data demand. This presentation and technical demo will have a deep dive on the path of accomplishing this requirement and lessons learned in an efficient and economical way of using Databricks and related technologies like Apache Spark, Delta Lake and MLflow.
– Hi everybody. I’m Shawn Benjamin. I’m with USCIS Citizenship and Immigration Services. I am the chief of Data and Business Intelligence. I’ve been working for USCIS since 2006 and we have created the first enterprise data warehouse within USCIS and I am here to talk about our journey to our modernization with using Databricks and cloud services. So we did coming into the cloud in roughly January of 2016. And we knew we were bringing some problems in with us. And the fact that we’re moving data. So we had a few problems that we were facing, we were using ETL, Informatica ETL, and the pipeline was fairly brittle and it’s a lengthy development cycle and the workflows were longer. And we were really lacking that real-time ability and near real-time, a time I like to call relevant-time data and truly lacking data science platform. So we knew where we were going into the cloud and we knew where we wanted to be. So the real question we had to ask ourself was how do we get there?
So we were bringing this legacy architecture in with us. So we’re bringing source system data in with Informatica putting it into our data warehouse which is Oracle base. And we’re still running Oracle to this day. And then presenting that data through Oracle OBIEE which is a product that we’ve named Smart within USCIS and SAS. We did have some data scientists and they were extracting data out of those tools, bringing it to their laptop, running their R and Python code and trying to do data science activities with low computing ability and definitely not within the Enterprise Vision. We had already a pretty good wealth of business analysts and statisticians using SAS and Oracle.
But we knew coming into the cloud that we were bringing this environment, which we’re very proud of. As you can see here, we had about 36 data sources with 28 ETL processes, 2300 users. And we did this and this was an evolution that roughly took eight years to obtain.
But once we moved into AWS we knew we needed to expand and our user base was expanding as well. And the speed of their growth was definitely driving where we wanted to go. So we started doing some research, when we first went into the cloud this area where it’s called the eCISOR that you see on the bottom left side of your screen was all within Oracle.
The Smart Subject Area is above that being OBIEE, and you see the staff’s libraries that are right, we moved data into the ArcGIS format, and we was not that successful with getting it out to the users timely. And we were starting to just starting to learn it ourselves and what it meant to get data from a data lake. Which is something that we were very interested in moving towards.
We needed a tool to get there. We had a Solutions Architect at that time and we started digging into this research and we discovered Databricks and the light bulb really clicked for us on what we needed to do moving forward to stay relevant. So we built a successful proof of concept bringing in Databricks. We brought in, into our VPC, we went into a 26 node set setup, where we were trying to attain these four objectives of how we could support, basically what our current mission was and still move forward. So obviously we needed to connect to those Oracle databases. So we connect the database clusters to those databases, created the relevant hives that we needed, started copying the data in and for context, we were seeing these big improvements right there where we’re saying, you’re bringing in these 120 million row tables into S3 from Oracle using Scala code within Databricks within just 10 minutes. And I think our Informatica timeline on that was somewhere around two to three hours. We needed to identify the appropriate participant screens and obviously create multiple Notebooks. We started wanting to get it out to our user base and have them start to take advantage of seeing the ability within the product through a lengthy amount of time. We basically came to what our current implementation is right now. And as you can see with Databricks and Spark setting alongside of Informatica as our ETL tool, the driving data from the sources, various sources whether it’s Kafka topics to Oracle Database Postgres SQL databases into the data lake as well as being able to source from both directly and through Databricks or from the Lake within our user interface tools such as Tableau, which we’ve recently procured, OBIEE and SAS and plus having a wealth of users within the Databricks Notebooks.
Delivering capability to multiple groups from a single point and not having so much of that swivel chair activity going on. One of the things that we did do in our implementation as well was we decided that we needed to bring our entire data warehouse into the data lake. That way we can take advantage of all of our interfacing tools within all of the data that we had in USCIS.
So what you’re seeing here now is what we look like after roughly four years of being in Amazon. We have balloomed from 30 something data sources to 75 data sources. With 35 application interfaces, multiple marks tripling the size of our user base cause the capability has been driven. So everything you see now at the bottom of this slide show is the enterprise data Lake for USCIS.
And then we still retained our Oracle platform. And what we’re doing right now is when we first started with Databricks, we were trying to bring the data from the sources and Oracle into the Lake. We’ve realized to save costs and the speed of ingestion is so much better to go right directly to the Lake. We’re starting to reverse, or re-engineer that flow of data that ETL of moving data from the source to the Lake, and then to Oracle if we need to. We can access the data Lake with all of our UIs.
And many of those are taken advantage of going through Databricks access or lakes. So Databricks sitting over that ArcGIS storage area. And what we’re also doing now is building our Data Merge and our warehouses directly into the Lake.
One of the things that I really appreciate, and you can see like the speed of where we’ve been able to grow is that Databricks has given us that capability to analyze data or move data quickly and understand more about that data in a quicker fashion. And some of the success stories that I talk about when I talk to our customers, that is I’m getting them to change over to our product is the ability to fail quickly and recover. When I say that we’re able to have teams, we brought in a few teams from around the country to build a product we call the National Practice Dataset is in older tools as they were using SAS to start. It was taking them a day to find out that their logic was not right. So they would build this logic, they would run all the data through the logic and find out the logical was right.
Within Databricks, and this is a real story, we ran a very complex query through Databricks, went to lunch, came back 19 minutes, it’s done. And the team had realized that their logic was incorrect. So what would have been an eight hour loss was a 19 minute loss. And you could really see the data analysts and the business analysts in the room like saying, “wow, this product just saved us so much time.” They could re-engineer what they were trying to build and try it again, and see successes. We’ve had similar stories of just taking some larger SAS logic where it was, taken for six hours within SAS to run, to build what you could call just-in-time warehousing, moving over Databricks and taking I think, 10 minutes and actually being more accurate. So those capabilities have been started, we’ve been able to take those stories and bring them out to the user community. And now a lot of teams are starting to build within their own data ponds inside the Lake and extending that out to the enterprise.
So through Databricks in the accomplishments that we’ve seen in this relatively short timeline is with the implementation of Delta Lake which probably we’ll talk about a little bit later.
The ability to insert update, delete which is a huge operation for any IT organization and be able to bring that into the lake. And have that ability with Delta Lake and what that brings to the teams. The ease of integration with our tools was key of having all those native connectors with OBIEE and SAS Tableau. It’s made things really easy for everybody. Integration with GitHub, for our continuous integrations in deployment. We’re continuing to work on automating our account provisioning with Databricks. That way we’re not spending less time. creating accounts, we’ve had a lot of request as of late. The tool has really took off for us, especially since we’ve gotten into this last bubble here, Machine Learning in our MLFow integration. Within USCIS there’s multiple teams outside of our direct team that have gone down the road of building out large ML projects and experiments.
So I’m gonna hand it over, the presentation over now to Prabha Rajendran who is the lead engineer on our team in implementing and installing capabilities for data Databricks. – Thank you Shawn, for that introduction. I’m Prabha Rajendran, I’ve been with the USCIC program since 2013, and I also got the opportunity to start and do the POC and get the groundwork for Databricks and get going and wherever we are today.
I was also part of this journey that’s my privilege. And getting onto the technical aspects of what we have done, whatever Shawn said in the previous slides some of the accomplishments which we have done so far. In that first comes the Change Data Capture. when we started this journey, we wanted to replicate the data from the Oracle targets into the Lake first. So how can we do this? So we just made sure that there is a pipeline or a FOK for the extra target where we can unload all the data into the lake through Databricks or using Scala Code. And that’s where we started this as a simple code. You can call it as just pull and push the data into the lake. So at least the BIEE Application users will start using the data lake going forward. That’s how we started, but how can we keep up with the change data capture what’s happening. If we have to have a new, real-time replication, how can we even accomplish this? Especially our two, three years before when Delta was not there, or it was not so popular at that time. Then we started everything with, we have to insert all the records into the lake. And then we had to do a rank function to do the de-duplication as well as to show the unique value for that particular record so that the reporting will not be affected the same way. So they transitioned from the Oracle to the Lake is seamless. So that’s how we started. That was a tedious way of coding for us, but Delta Lake which was introduced with all the enhancements that has been made so far in the Delta Lake for all the insert, updates and deletes to be captured made our life very easy. So we can use any of the streams like Kafka or Kinesis, or even AWS DMS, anything to land the data into S3. And then from there we wrote a framework in Delta. So all the data will be stored in S3 as a Delta Lake. So that’s how we transitioned. So what are we gained out of this is one is the users who are reporting off the Central Lake, they will not have the latency because the inserts and the rights were happening at the same time. And there was no latency for the users to retrieve these records. That was one advantage. And also there was a fast ingestion of the CDC changes. Then the quality of the reports which they were going against the Delta Lake, the performance was improved. That’s another thing, and if there was any schema changes back in the So System, it was very seamlessly replicated. Also the failures were gracefully tackled. So these were some of the advantages which Delta Lake as such had in its product. We kinda utilized that in our program, and we took forward that. But yes, what all we learned through this process when we implemented Delta Lake? Some of the things which we did not pay attention to was the vacuuming. We kinda, initially when we started this experimentation with Delta Lake. So the vacuuming was something which we kinda accommodated for a month and we still had our own hiccups when the logs were too huge and it really broke down our system. So vacuuming one mandatory thing, which we now have a seven day retention period, and even sometimes two day retention period, if the volume of the incremental for the CDC is more. So we kinda made sure that vacuuming is also part of this, any of this Delta Code Framework. So that is one thing we learned. And also one more thing we noticed out of this data was the storage requirement obviously increased on our end because we keep the logs aside. So that’s how we kinda modernized the Change Data Capture into our program. Let’s talk about once the data is landed into the Lake, how did our applications connect and how did our applications connect to these analytical platform? I’m sorry, the applications connected to the Lake. So that’s where comes in Tableau. These are some of the dashboards, which we wanted to show what we have implemented so far. And the connectivity initially when we started Databricks, when there was no inbuilt connector available, the need to connect as available. we kinda utilized the ODBC simmer drivers to fetch data from the Lake into the Tableau. But now because the Databricks has its own way to connect us. This has become even easier for us and the performance for extraction of data into the Tableau servers has become faster. When I’m talking about a faster performance, that is one more milestone which we achieved was the Hyper API. The concept of Hyper API which helps us to extract data faster in the Notebook in the database, and then publish that to the server so the reporting end will not take that strain. So that was one of our bigger Milestone which we have achieved, and that helped us to even refresh data in a faster rate on the reporting end. This is another example of the OBIEE and SAS, which we pointed to the Lake from the data which was loaded through Databricks Scala code, OBIEE and SAS before it was pointed to Oracle Targets.
What we noticed was, some of the queries which were some of the dashboards, which are created in OBIEE, which was performing say around 15 minutes run-time, or sometimes those dashboards would not have been returned the results because of the load which was given on the Oracle database. And sometimes the queries used to chan for a longer time, and it was not coming but whereas, when we ran the same queries pointing to Databricks, the query or a return was under 15 seconds, 15 to 20 seconds, maybe. So what we noticed was the performance was drastic and the number of failures or the query always came back, that was consistent. So that is one more thing we had advantages in pointing our data to the Lake especially at existing dashboards, which were pointing to a database, to a Lake where the data is underneath is being written through a Delta Lake.
Going to next, what did we do on the data science field? This itself will be a bigger discussion for us, but let me brief down whatever we have done through this journey. How did we start? We started taking a particular use case for USCIS especially. We wanted to see, we wanted to just data or predict,
what is the probability of people not showing to an appointment and why do they not show up to an appointment, for a biometrics appointment or any appointment which they have with USCIC. Why did they not show? What is the prediction? How can this be awarded? Or what are the prediction factors which is causing this no show in an appointment? that is a simple use case which we took. How did we move ahead with that? We used to R and Python in the backend, and we use some of the inbuilt libraries to much the data, and we had the data with, we divided the 70, 30% rule. And we run those data through the prediction models to predict whether the location of a person or the lead time in which the appointment has been scheduled. Are these other factors even causing this kind of no show or a person to not show into an appointment. So these are some of the prediction models, which we did in using the Databricks platform. And we did not stop there. We also took that model and ran it through our MLFlow with our runtime clusters, which we have the Databricks we kinda used that, and we ran end to end on an ML model to see how this can be predicted and how this can be run through a model to see what is the pattern we are seeing. As a next one also we took another level where we wanted to analyze
or do some analysis on sentiment analysis of any of the survey data. So this is one of the example which we are showing here is one for Text and Log Mining. On the Text Mining we just wanted to see what is the sentiment it has in a particular set of data. So that prediction was done using R and Databricks. And we were able to do this in a very short period of time in a matter of day or two to give the sentiment analysis results. So it was so quick because all the inbuilt libraries in Databricks helps us to get these results. And also we also did some Log Mining to getting to see how the performance of any of the systems are doing or how many users are logging in. And what is the frequency of a user logging in. All these kinds of things from the Input Data Log, what we store in S3, we wanted to do some prediction on that so we did some Log Mining as well as part of our experiments. What of the model which we created. We also did the time series to predict how many number of applications were coming in the future years. Or what is the pattern that was so deep but also high and how many recipients are even applying for certain kinds of applications. These were also run through the time-series models. And what we did was we kinda ran through all the algorithms, like the RainForest Regression Models or Logistic Regression all these models and we predicted which will be the best, which has the better accuracy based off that we kinda predicted the same models, which we had done it in a previous library. We kind of managed to do that this year. And also there was one more thing on the Hedge Tool Integration. Hedge Tool is another AutoML.
AutoML available with us, which helped us to integrate within Databricks seamlessly. It was just an additional cluster to the existing cluster and additional of libraries, library to a cluster. And that helped us to utilize the Hedge Tool also as part of this experiments. So let’s get back to the Security and Governance part of the slide.
We were shifting a mindset from the Legacy to the Data Lake concept. We wanted to maintain the same privileges and ACLs for all the data sets,
which we have in that Lake so that same roles and privileges are maintained. So this Databricks with its inbuilt ACLs and ACL, we were able to do that. And also be maintained credentials,
Credentials Secret Management, where we had a secret manager to store all the credentials and reference them in the Notebooks. So that was some of the available features in Databricks which we utilized as part of our system. And also we created the exact IAM rules, which are required on the S3 bucket level, as well as the ACL rules on the Databricks to have this security and governance for any of the datasets we have. On the Databricks Management API Usage. So this being an inbuilt feature of the Databricks, we were able to create any number of clusters or jobs and monitor them on a daily basis. And we also integrated with Airflow to manage it and orchestrate this monitoring experience. So this is one thing which we utilized as well as with our libraries. Anything can be reproduced in your Notebooks using the existing available list of libraries. So that was easier for our machine-learning and data science team to do any of the programs which would have taken long number of time. It should have taken more time for the team to code all that through instead of attaching to a library. So this was more easy for us to do that. And also integration in deployment. We integrated Git automatically we did the configuration of Git Databricks and that was an easy sync for us to enable for us to sync all our changes. And also it helped us in our continuous integration and deployment perspective. Apart from that, the last one is the MLFlow, MLFlow is a big milestone for us, and we are able to run these models using the MLFlow clusters. And the API has been helpful so far.
That’s all I had for today back to you Shawn. – Thank you Prabha for filling us in on the work that you’ve done over the years. It’s been a tremendous amount of work and we’ve had a tremendous amount of accomplishments. So what we were talking about, in the last couple of slides I have for y’all are the lessons that we learned. And it’s just kind of a combination of things that we found along the way. And one of those things is absolutely building a training plan. We at USCIS still didn’t fully grasp what data science was when we brought this product in. So we were trying to learn out ourselves. So we actually went through the training ourselves to use the product and the tool. And then we brought in, I think maybe 30 people from around the country in different offices to actually go through the training and some of them are advanced and they were able to grasp the concepts just fine, some not as much. So at that moment there we decided that we needed to work harder with our user base and build our own internal development plan for our users. Like with any place of business training within your own data sets is always very big. It helps the user base to realize the power of the product. We’re actually in the process right now of building a customized training plan alongside Databricks and training and support that will suit the needs of our mid-level to advance users. Especially as a lot of
shops have now started to grasp onto Databricks and they want to move their code out of SAS into Databricks. So building a little bit more customized training, it’s allowed us to actually build a stronger connection with our users. The cloud based experience through this tool has become fully realized for us. Where we thought moving into the cloud immediately, like anybody has been sold anything is gonna be immediately fast and ease. It wasn’t until we brought in Databricks. We could really grasp on that. And I think I showed you that with the ability to expand as we have becoming so scalable, not only did we expand the amount of products we’re delivering to folks, we expanded a knowledge base within our own team. When Probha first was on the first wave of folks to use the product I think that caused a lot of drive from other members within our development team to wanna learn the tool and learn the product as well. And through that we’ve brought a bunch of Informatica ETL developers OBIEE developers
into learning how to use this tool. Problem was on the first wave of learning how to use the product and an understanding it. She trained the second wave and now the second wave is training the rest of the wave of our team. We have a lot of folks within our development shop that are experienced within the tool and know how to use the product to deliver. Subject matter expertise is also very important and not just subject matter expertise within Databricks or within cloud computing but the data. And it goes back to the training plan. The more you can realize the data and what the product is allowing you to do with the data can really unlock some very powerful stuff. So as we started bringing our subject matter experts of the data, into the product, training them how to be at minimum proficient with the product. They saw the value and they started becoming better. And in turn have started to become subject matter experts within the tool itself. Not necessarily coming to IT for support, but going to each other. What I like to call it, Tier Zero, ask your friend and they may know the answer and you may be able to help you along. And when you start getting that type of organic ground roots support within these tools, it just makes everything easier for everybody. And it all aligns your business to your IT. In the automation is unbelievable of what we’ve been able to accomplish there.
And when we’re talking about automation, not just automation of jobs and the automation of data and how it can work for you. To save costs within USCIS and in within Amazon, we do turn back a lot of our servers overnight. And in an off hours, we’ve been able to build jobs in place where developers can go in. Everybody is remote and especially right now, we’re all very remote working on different time cycles. I can go in, turn the environments on work through data, turn your environments back off. Being able to allow folks to do that as we’ve seen some major cost savings. And eventually we really wanna get to the point where
we’re driving a lot of the automation within the tool using machine learning and operationalizing our machine learning models to do the tedious work, where we can be saving the time of the human element and refocusing that workload somewhere else.
So, as we implemented Databricks, we built a success strategy and we hit every point of our success strategy, as you can see here. So the first thing right away, performance, obviously with the ability to scale and leverage on domain and spot instances being an Oracle guy in the past where we didn’t have that ability, we immediately saw benefits right there. And the scalable read-write performances that we had on S3 were amazing. Support for the variety of statistical program it has been huge for USCIS, the bulk of USCIS users are SQL developers and SQL programmers. So being able to slide them into the tool with least with that language, and we did have a lot of R and Python users as well. But they were able to learn multiple languages working a singular Notebook and move between the languages. As the agencies heard us talk a lot about Scala developing as these are main code preference for ETL. We drove in some benefits of showing that the power of that language to our users, and now those users are actually looking on getting more familiar with Scala and taking training on that.
The ability to work in so many languages actually opened up the diversification of our analytics community as they moved into Databricks. And obviously the machine learning and a deep learning within those tools of how we’ve been able to take that concept of moving data through a machine process, into outcomes where we can start to take a tedious job function that would have been done manually by, at a clerk level or a singular person level, move it through the process and the ability to derive direct value from that. And you can move those resources now from a more tedious task into other areas of function.
The integration with our existing tools allowing connections across all the tools that we have has been very very big for us. Having so many UIs now with OBIEE, Tableau and SAS, we are able to integrate the products with each other. What we try to focus on and getting our user community to do is do your Justin time builds in your just time warehousing on the Databricks side, build the data, do your heavy lift there. If you’re more comfortable with using SAS, or you’re more comfortable with these in Tableau, do your heavy lift within Databricks, build your datasets, and then use SAS, use Tableau to analyze and work that data. So this wraps up our presentation.
Thanks everybody very much for allowing Probha and I to share our journey with you.
U.S. Citizenship and Immigration
Shawn Benjamin has nearly 20 years of experience in Federal Sector Information Technology experience. Joining U.S. Citizenship and Immigration Services in 2006, he is a founding member of USCIS's enterprise data warehouse program and has continued to apply innovation in analytics and data strategy. Shawn now serves as the Chief of Data & Business Intelligence for USCIS Office of Information Technology.
U.S. Citizenship and Immigration
Prabha Rajendran has 18 years of experience in Data integration Architecture using Big data Solutions, cloud computing ,Business Intelligence and Data Sciences. Joining U.S. Citizenship and Immigration Services in 2014, she has been instrumental and leading the new technical innovations in modernizing USCIS Analytical platform. Prabha now serves as the Technical Lead of Data & Business Intelligence for USCIS Office of Information Technology.