Skip to main content



What is a transaction? In the context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consist{...}
Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model. Types of Gradient Descent There are three primary types of gradient descent used in{...}
What is Alternative Data? Alternative data is information gathered by using alternative sources of data that others are not using;  non-traditional information sources. Analysis of alternative data can provide insights beyond that which an indus{...}
Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations. Such “anomalous” behavior typically translates to some kind of a problem like{...}
What is Apache Hive? Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS) , one aspect of a larger Hadoop Ecosystem. With extensive Apach{...}
What is Apache Kudu? Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random access millisecond-scale access to individual rows toget{...}
What is Apache Kylin? Apache Kylin is a distributed open source online analytics processing (OLAP) engine for interactive analytics Big Data. Apache Kylin has been designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spar{...}
What Is Apache Spark? Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Apache Spark started in 2009 as a research project at the{...}
What is Apache Spark as a Service? Apache Spark is an open source cluster computing framework for fast real-time large-scale data processing. Since its inception in 2009 at UC Berkeley’s AMPLab, Spark has seen major growth. It is currently rated{...}
What is an Artificial Neural Network? An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain. How Do Artificial Neural Networks Work? Artificial Neural Networks can be best viewed as weigh{...}
What is Automation Bias? Automation bias is an over-reliance on automated aids and decision support systems. As the availability of automated decision aids is increasing additions to critical decision-making contexts such as intensive care units, or {...}
What Are Bayesian Neural Networks? Bayesian Neural Networks (BNNs) refers to extending standard networks with posterior inference in order to control over-fitting. From a broader perspective, the Bayesian approach uses the statistical methodology so {...}
The Difference Between Data and Big Data Analytics Prior to the invention of Hadoop, the technologies underpinning modern storage and compute systems were relatively basic, limiting companies mostly to the analysis of "small data." Even this relative{...}
Bioinformatics is a field of study that uses computation to extract knowledge from large collections of biological data. Bioinformatics refers to the use of IT in biotechnology for storing, retrieving, organizing and analyzing biological data. An out{...}
At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional program{...}
What is Complex Event Processing [CEP]? Complex event processing [CEP] also known as event, stream or event stream processing is the use of technology for querying data before storing it within a database or, in some cases, without it ever being stor{...}
Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate{...}
In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language {...}
What is a Data Analysis Platform? A data analytics platform is an ecosystem of services and technologies that needs to perform analysis on voluminous, complex and dynamic data that allows you to retrieve, combine, interact with, explore, and visuali{...}
As the amount of data, data sources and data types grow, organizations increasingly require tools and strategies to help them transform that data and derive business insights. Processing raw, messy data into clean, quality data is a critical step bef{...}
What is Data Governance? Data governance is the oversight to ensure data brings value and supports the business strategy. Data governance is more than just a tool or a process. It aligns data-related requirements to the business strategy using a fram{...}
What is a Data Lakehouse? A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intell{...}
What is a data marketplace or data market? Data marketplaces, or data markets, are online stores that enable data sharing and collaboration. They connect data providers and data consumers, offering participants the opportunity to buy and sell data an{...}
What is a data mart? A data mart is a curated database including a set of tables that are designed to serve the specific needs of a single data team, community, or line of business, like the marketing or engineering department. It is normally smaller{...}
If you work in a role that interacts with data, you'll have come across a data pipeline, whether you realize it or not. Many modern organizations use a variety of cloud-based platforms and technologies to run their operations, and data pipelines are {...}
In today’s highly connected world, cybersecurity threats and insider risks are a constant concern. Organizations need to have visibility into the types of data they have, prevent the unauthorized use of data, and identify and mitigate risks around th{...}
What is data sharing? Data sharing is the ability to make the same data available to one or many consumers. The ever-growing amount of data has become a strategic asset for any company. Sharing data — within business units as well as consuming data f{...}
What Is Data Transformation? Data transformation is the process of taking raw data that has been extracted from data sources and turning it into usable datasets. Data pipelines often include multiple data transformations, changing messy information i{...}
What is a data vault? A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites. Hubs represent core business concepts, links {...}
What is a data warehouse? A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. Data warehouses are typically used for business inte{...}
Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of{...}
What is a DataFrame? A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a f{...}
What is a Dataset? A dataset is a structured collection of data organized and stored together for analysis or processing. The data within a dataset is typically related in some way and taken from a single source or intended for a single project. For {...}
What is Deep Learning? Deep Learning is a subset of machine learning concerned with large amounts of data with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred t{...}
What is demand forecasting? Demand forecasting is the process of projecting consumer demand (equating to future revenue). Specifically, it is projecting the assortment of products shoppers will buy using quantitative and qualitative data. Retailers a{...}
Dense tensors store values in a contiguous sequential block of memory where all values are represented. Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications. There are a number of software prod{...}
What is a Digital Twin? The classical definition of of digital twin is; ""A digital twin is a virtual model designed to accurately reflect a physical object."" – IBM[KVK4] For a discrete or continuous manufacturing process, a digital twin gathers sys{...}
What is a DNA Sequence? The DNA sequence is the process of determining the exact sequence of nucleotides of DNA (deoxyribonucleic acid).  Sequencing DNA the order of the four chemical building blocks - adenine, guanine, cytosine, and thymine als{...}
What is ETL? As the amount of data, data sources, and data types at organizations grow, the importance of making use of that data in analytics, data science and machine learning initiatives to derive business insights grows as well. The need to prior{...}
Feature engineering for machine learning Feature engineering, also called data preprocessing, is the process of converting raw data into features that can be used to develop machine learning models. This topic describes the principal concepts of feat{...}
Generative AI is changing the way humans create, work and communicate. Databricks explains how generative AI works and where it’s heading next. {...}
Genomics is an area within genetics that concerns the sequencing and analysis of an organism's genome. Its main task is to determine the entire sequence of DNA or the composition of the atoms that make up the DNA and the chemical bonds between the DN{...}
What Is a Hadoop Cluster? Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in parallel by us{...}
HDFS HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. This open source framework works by rapidly transferring data between nodes. It's often used by companies who need to handle and store big data. HDF{...}
What is the Hadoop Ecosystem? Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of the most well-known tools of th{...}
In computing, a hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from whic{...}
What is a Hive Date Function? Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditiona{...}
What is Hosted Spark? Apache Spark is a fast and general cluster computing system for Big Data built around speed, ease of use, and advanced analytics that was originally built in 2009 at UC Berkeley. It provides high-level APIs in Scala, Java, Pytho{...}
What is a Jupyter Notebook? A Jupyter Notebook is an open source web application that allows data scientists to create and share documents that include live code, equations, and other multimedia resources. What are Jupyter Notebooks used for? Jupyter{...}
What is a Keras Model? Keras is a high-level library for deep learning, built on top of Theano and Tensorflow. It is written in Python and provides a clean and convenient way to create a range of deep learning models. Keras has become one of the{...}
What is Lakehouse for Retail? Lakehouse for Retail is Databricks’ first industry-specific Lakehouse. It helps retailers get up and running quickly through solution accelerators, data sharing capabilities, and a partner ecosystem. Lakehouse for Retail{...}
What is Lambda Architecture? Lambda architecture is a way of processing massive quantities of data (i.e. "Big Data") that provides access to batch-processing and stream-processing methods with a hybrid approach. Lambda architecture is used to solve t{...}
What are Large Language Models (LLMs)? Large language models (LLMs) are a new class of natural language processing (NLP) models that have significantly surpassed their predecessors in performance and ability in a variety of tasks such as answering op{...}
What Is LLMOps? Large Language Model Ops (LLMOps) encompasses the practices, techniques and tools used for the operational management of large language models in production environments. The latest advances in LLMs, underscored by releases such as Op{...}
Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and mod{...}
What is a machine learning Model? A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. For example, in natural language processing, machine learning models can parse and correctly recognize {...}
What is Managed Spark? A managed Spark service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. By using such an automation you will be able to quickly create clusters on -demand, mana{...}
What is MapReduce? MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem.  It takes away the complexity of distributed programming by exposing two processing steps that developers implement: 1) Map and 2) {...}
What is a materialized view? A materialized view is a database object that stores the results of a query as a physical table. Unlike regular database views, which are virtual and derive their data from the underlying tables, materialized views contai{...}
 What is a medallion architecture? A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through{...}
Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and c{...}
What is MLOps? MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOp{...}
Model risk management refers to the supervision of risks from the potential adverse consequences of decisions based on incorrect or misused models. The aim of model risk management is to employ techniques and practices that will identify, measure and{...}
Multi-Statement Transactions for Databricks Delta Tables Databricks has support for multi-statement transactions if the underlying tables are Databricks Delta tables.  This means that all of the statements within the transaction will be atomic ({...}
What is a Neural Network? A neural network is a computing model whose layered structure resembles the networked structure of neurons in the brain. It features interconnected processing elements called neurons that work together to produce an output f{...}
What is Open Banking? Open banking is a secure way to provide access to consumers' financial data, all contingent on customer consent.² Driven by regulatory, technology, and competitive dynamics, Open Banking calls for the democratization of customer{...}
What is Orchestration? Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process. These processes can consist of mul{...}
What is Overall Equipment Effectiveness? Overall Equipment Effectiveness(OEE) is a measure of how well a manufacturing operation is utilized (facilities, time and material) compared to its full potential, during the periods when it is scheduled to ru{...}
When it comes to data science, it's no exaggeration to say that you can transform the way your business works by using it to its full potential with pandas DataFrame. To do that, you'll need the right data structures. These will help you be as effic{...}
What is Parquet? Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bul{...}
What is Personalized Finance? Financial products and services are becoming increasingly commoditized and consumers are becoming more discerning as the media and retail industries have increased their penchant for personalized experiences. To remain c{...}
What is Predictive Analytics? Predictive analytics is a form of advanced analytics that uses both new and historical data to determine patterns and predict future outcomes and trends. How Does Predictive Analytics Work? Predictive analytics uses many{...}
What is predictive maintenance? Predictive Maintenance, in a nutshell, is all about figuring out when an asset should be maintained, and what specific maintenance activities need to be performed, based on an asset’s actual condition or state, rather {...}
PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to cre{...}
What is PySpark? Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with {...}
What Is Real-Time Analytics? Real-time analytics refers to the practice of collecting and analyzing streaming data as it is generated, with minimal latency between the generation of data and the analysis of that data. Real-time analytics is often use{...}
What is real-time data for Retail? Real-time retail is real-time access to data. Moving from batch-oriented access, analysis and compute will allow data to be “always on,” therefore driving accurate, timely decisions and business intelligence. Real-t{...}
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API tha{...}
What Is Retrieval Augmented Generation, or RAG? Retrieval augmented generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data. This is done by retrieving data/d{...}
 What is a snowflake schema? A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for business intelligence and{...}
If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs What are Resilient Distributed Datasets? RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault {...}
Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; res{...}
What is Spark Elasticsearch? Spark Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. It is a GitHub open source, RESTful search engine built on top of Apache Lucene and rele{...}
Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can al{...}
Apache Spark Streaming is the previous generation of Apache Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Apache Spark called Structured Stre{...}
What is Spark Performance Tuning? Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents {...}
What is Sparklyr? Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little late{...}
SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follow{...}
Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library. Sptensor is a class that represents the sparse tensor. A spar{...}
What is a star schema? A star schema is a multi-dimensional data model used to organize data in a database so that it is easy to understand and analyze. Star schemas can be applied to data warehouses, databases, data marts, and other tools. The star {...}
How Does Stream Analytics Work? Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams. These streams are triggered by a s{...}
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a stre{...}
What is supply chain management? Supply chain management is the process of planning, implementing and controlling operations of the supply chain with the goal of efficiently and effectively producing and delivering products and services to the end cu{...}
In November of 2015, Google released its open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters of GPUs. One of the biggest advan{...}
What is the Tensorflow Estimator API? Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions. TensorFlow pro{...}
What is the Tungsten Project? Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance cl{...}
Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute r{...}
Unified Data Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Data Analytics makes it e{...}
Databricks' Unified Data Analytics Platform helps organizations accelerate innovation by unifying data science with engineering and business. With Databricks as your Unified Data Analytics Platform, you can quickly prepare and clean data at mass{...}
What is a Unified Data Warehouse? A unified database also known as an enterprise data warehouse holds all the business information of an organization and makes it accessible all across the company. Most companies today, have their data managed in iso{...}
Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. The platform works by distributing Hadoop big data and analytics jobs across nodes in a computing cluster, breaking them{...}