What is a transaction? In the context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consistent state. The classic example o {. . .}
Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model. Types of Gradient Descent {. . .}
What is Alternative Data? Alternative data is information gathered by using alternative sources of data that others are not using;  non-traditional information sources. Analysis of alternative data can provide insights beyond that which an industry’s regular data sources are ca {. . .}
Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations. Such “anomalous” behavior typically translates to some kind of a problem like credit card fraud, a failing machine, or a cy {. . .}
What is Apache Hive? Apache Hive is open-source data warehouse software designed to read, write, and manage large dat {. . .}
What is Apache Kudu? Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random acces {. . .}
What is Apache Kylin? Apache Kylin is a distributed open source online analytics processing (OLAP) engine for interactive analytics Big Data. Apache Kylin has been designed to provide SQL interface and multi-dimensional analysis (OLAP) on {. . .}
What Is Apache Spark? Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Apache Spark started in 2009 as a research project at the University of California, Berkeley. {. . .}
What is Apache Spark as a Service? Apache Spark is an open source cluster computing framework for fast real-time large-scale data processing. Since its inception in 2009 at UC Berkeley’s AMPLab, Spark has seen major growth. It is currently rated as the largest open source communi {. . .}
What is an Artificial Neural Network? An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain. How Do Artificial Neural Networks Work? Artificial Neural Networks can be best viewed as weighted direc {. . .}
What is Automation Bias? Automation bias is an over-reliance on automated aids and decision support systems. As the availability of automated decision aids is increasing additions to critical decision-making contexts such as intensive care units, or aircraft cockpits are beco {. . .}
What Are Bayesian Neural Networks? Bayesian Neural Networks (BNNs) refers to extending standard networks with posterior inference in order to control over-fitting. From a broader perspective, the Bayesian approach uses the statistical methodology so that everything has a probab {. . .}
The Difference Between Data and Big Data Analytics Prior to the invention of Hadoop, the technologies underpinning modern storage and compute systems were relatively basic, limiting compan {. . .}
Bioinformatics is a field of study that uses computation to extract knowledge from large collections of biological data. {. . .}
At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional programming constructs in Scala and designed with t {. . .}
What is Complex Event Processing [CEP]? Complex event processing [CEP] also known as event, stream or event stream processing is the use of technology for querying data before storing it within a database or, in some cases, without it ever being stored. Complex event processing i {. . .}
Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or interaction wit {. . .}
In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language processing, signal processing, and various other p {. . .}
What is a Data Analysis Platform? A data analytics platform is an ecosystem of services and technologies that needs to perform analysis on voluminous, complex and dynamic data that allows you to retrieve, combine, interact with, explore, and visualize data from the various sources a compan {. . .}
What is Data Governance? Data governance is the oversight to ensure data brings value and supports the business strategy. Data governance is more than just a tool or a process. It aligns data-related requirements to the business strategy using a framework across peo {. . .}
What is a Data Lakehouse? A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data wareho {. . .}
What is a data mart? A data mart is a curated database including a set of tables that are designed to serve the specific needs of a single data team, community, or line of business, like the marketing or engineering department. It is normally smaller and more focused than a data war {. . .}
What is data sharing? Data sharing is the ability to make the same data available to one or many consumers. Nowadays, the ever-growing amount of data has become a strategic asset for any company. Sharing data - within your organization or externally - is an enabling technology fo {. . .}
What is a data vault? A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites. Hubs represent core business concepts, {. . .}
What is a data warehouse? A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. Data warehouses are typically used for business intelligence (BI), reporting and d {. . .}
Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. The primary differentiations a {. . .}
What is a DataFrame? A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of s {. . .}
Datasets are a type-safe version of Spark’s structured API for Java and Scala. This API is not available in Python and R, because those are dynamically typed languages, but it is a powerful tool for writing large applications in Scala and Java. Recall that DataFrames are a distributed {. . .}
What is Deep Learning? Deep Learning is a subset of machine learning concerned with large amounts of data with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred to as deep neural networks. It i {. . .}
What is demand forecasting? Demand forecasting is the process of projecting consumer demand (equating to future revenue). Specifically, it is projecting the assortment of products shoppers will buy using quantitative and qualitative data. {. . .}
Dense tensors store values in a contiguous sequential block of memory where all values are represented. Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications. There are a number of software products that can perform tensor computations, s {. . .}
What is a Digital Twin? The classical definition of of digital twin is; ””A digital twin is a virtual model designed to accurately reflect a physical object.”” – IBM {. . .}
What is a DNA Sequence? The DNA sequence is the process of determining the exact sequence of nucleotides of DNA (deoxyribonucleic acid).  Sequencing DNA the order of the four chemical building blocks - adenine, guanine, cytosine, and thymine also known as bases, occur within the {. . .}
Feature engineering for machine learning Feature engineering, also called data preprocessing, is the process of converting raw data into features that can be used to develop machine learning models. This topic describes the principal concepts of feature engineering and the role it plays {. . .}
Genomics is an area within genetics that concerns the sequencing and analysis of an organism’s genome. Its main task is to determine the entire sequence of DNA or the composition of the atoms that make up the DNA and the chemical bonds between the DNA atoms. The field of genomics is interested {. . .}
What Is a Hadoop Cluster? Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It enables big data analytics processing tasks to be broken down into smaller tasks that can be perfor {. . .}
HDFS HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. This open source framework works by rapidly transferring data between nodes. It’s often used by companies who need to handle and store big data. HDFS is a key component of many Hadoop {. . .}
What is the Hadoop Ecosystem? Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of t {. . .}
In computing, a hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. Here are the {. . .}
What is a Hive Date Function? Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditional operators, mathematical functio {. . .}
What is Hosted Spark? Apache Spark is a fast and general cluster computing system for Big Data built around speed, ease of use, and advanced analytics that was originally built in 2009 at UC Berkeley. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine {. . .}
What is a Jupyter Notebook? A Jupyter Notebook is an open source web application that allows data scientists to create and share documents that include live code, equatio {. . .}
What is a Keras Model? Keras is a high-level library for deep learning, built on top of Theano and Tensorflow. It is written in Python and provides a clean and convenient way to create a range of deep learning models. {. . .}
What is Lakehouse for Retail? Lakehouse for Retail is Databricks’ first industry-specific Lakehouse. It helps retailers get up and running quickly through solution accelerators, data sharing capabilities, and a partner ecosystem. {. . .}
What is Lambda Architecture? Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach. Lambda architecture is used to solve the problem of computing arbitra {. . .}
Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities surround {. . .}
What is a machine learning Model? A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. For example, in natural language processing, machine learning models can parse and correctly recognize the intent behind previously unhe {. . .}
What is Managed Spark? A managed Spark service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. By using such an automation you will be able to quickly create clusters on -demand, manage them with ease and turn them {. . .}
What is MapReduce? MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem.  It takes away the complexity of distributed programming by exposing two processing steps that developers implem {. . .}
What is a medallion architecture? A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of da {. . .}
Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and cleaning, extracting features, and training a class {. . .}
What is MLOps? MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps is a collaborative function, often com {. . .}
Model risk management refers to the supervision of risks from the potential adverse consequences of decisions based on incorrect or misused models. The aim of model risk management is to employ techniques and practices that will identify, measure and mitigate model risks i.e. the possibility of mode {. . .}
What is a Neural Network? A neural network is a computing model whose layered structure resembles the networked structure of neurons in the brain. It features interconnected processing elements called neurons that work together to produce an output function. Neural networks are made of {. . .}
What is Open Banking? Open banking is a secure way to provide access to consumers’ financial data, all contingent on customer consent.² Driven by regulatory, technology, and competitive dynamics, Open Banking calls for the democratization of customer data to non-bank third partie {. . .}
What is Orchestration? Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process. These processes can consist of multiple tasks that are automated and ca {. . .}
What is Overall Equipment Effectiveness? Overall Equipment Effectiveness(OEE) is a measure of how well a manufacturing operation is utilized (facilities, time and material) compared to {. . .}
When it comes to data science, it’s no exaggeration to say that you can transform the way your business works by using it to its full potential with pandas DataFrame. To do that, you’ll need the right data stru {. . .}
What is Parquet? Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed {. . .}
What is Personalized Finance? Financial products and services are becoming increasingly commoditized and consumers are becoming more discerning as the media and retail industries have increased their penchant for personalized experiences. To remain competitive, banks have to offer a {. . .}
What is Predictive Analytics? Predictive analytics is a form of advanced analytics that uses both new and historical data to determine patterns and predict future outcomes and trends. How Does Predictive Analytics Work? Predictive analytics uses many techniques suc {. . .}
What is predictive maintenance? Predictive Maintenance, in a nutshell, is all about figuring out when an asset should be maintained, and what specific maintenance activities need to be performed, based on an asset’s actual condition or state, rather than on a fixed schedule, so that y {. . .}
PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to create a Conda environment or use an existing one. {. . .}
What is PySpark? Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (R {. . .}
What is real-time data for Retail? Real-time retail is real-time access to data. Moving from batch-oriented access, analysis and compute will allow data to be “always on,” therefore driving accurate, timely decisions and business intelligence. Real-tim {. . .}
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed {. . .}
What is a snowflake schema? A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for bus {. . .}
If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs What are Resilient Distributed Datasets? RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in natur {. . .}
Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or {. . .}
What is Spark Elasticsearch? Spark Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. It is a GitHub open source, RESTful search engine built on top of Apache Lucene and released under the terms of the Apache Lic {. . .}
Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can als {. . .}
Apache Spark Streaming is the previous generation of Apache Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Apache Spark called Structured Streaming. You should use Spark Structured Streami {. . .}
What is Spark Performance Tuning? Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in S {. . .}
What is Sparklyr? Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little latency. Sparklyr is an effecti {. . .}
SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follows R’s syntax instead of Python. For the most {. . .}
Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library. {. . .}
What is a star schema? A star schema is a multi-dimensional data model used to organize data in a database so that it is easy to understand and analyze. Star schemas can be applied to data warehouses, databases, data marts, and other tools. The star schema design is optimized for {. . .}
How Does Stream Analytics Work? Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams. These streams are triggered by a specific event that happens {. . .}
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and allow {. . .}
In November of 2015, Google released its open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters {. . .}
What is the Tensorflow Estimator API? Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions. {. . .}
What Are Transformations? In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will {. . .}
What is the Tungsten Project? Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern {. . .}
Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute resources while Caffe focused on model deployment o {. . .}
Unified Data Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Data Analytics makes it easier for enterprises to build data pipelines acro {. . .}
Databricks' Unified Data Analytics Platform helps organizations accelerate innovation by unifying data science with engineering and business. With Databricks as your Unified Data Analytics Platform, you can quickly prepare and clean data at massive scale with no limitations. The pl {. . .}
What is a Unified Data Warehouse? A unified database also known as an enterprise data warehouse holds all the business information of an organization and makes it accessible all across the company. Most companies today, have their data managed in isolated silos while different {. . .}
Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. The platform works by distributing Hadoop big data and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in pa {. . .}