75 Big Data Terms You Should Know

75 Big Data Terms You Should Know

Part 1 (25 terms)

If you’re new to big data, you might find the field overwhelming and daunting. However, here’s a list of 25 big data terms to get you started.

Algorithm: Algorithm can be understood as a mathematical formula or a statistical process used for data analysis. So, how is "algorithm" related to big data? You know, although the word algorithm is a general term, in this era of popular big data analysis, algorithms are often mentioned and become more and more popular.

Analytics: Let's imagine a very likely scenario where your credit card company sends you an email with a list of your card transfers throughout the year. What if you take this list and start to study the percentages of your spending on food, clothing, entertainment, etc.? You are doing analytics, you are mining useful information from your raw data (which can help you make decisions about your spending for the coming year). So, what if you process the posts of people in an entire city on Twitter and Facebook in a similar way? In this case, we can call it big data analysis. The so-called big data analysis is to reason about a large amount of data and tell useful information from it. There are three different types of analysis methods below, and now let's sort them out.

Descriptive Analytics: If you can only tell me that your credit card spending last year was 25% on food, 35% on clothing, 20% on entertainment, and the remaining 20% ​​on miscellaneous expenses, then this analysis method is called descriptive analytics. Of course, you can also find more details.

Predictive Analytics: If you analyze the history of credit card spending over the past five years and find that the spending each year basically shows a trend of continuous change, then in this case you can predict with a high probability that the spending status next year should be similar to the past. This does not mean that we are predicting the future, but it should be understood as "using probability to predict" what may happen. In predictive analysis of big data, data scientists may use advanced technologies such as machine learning and advanced statistical processing methods (which we will talk about later) to predict weather conditions, economic changes, and so on.

Prescriptive Analytics: Let's use the example of credit card transfers to understand this. If you want to find out which type of your consumption (such as food, entertainment, clothing, etc.) can have a huge impact on overall consumption, then the prescriptive analytics method based on predictive analytics introduces "dynamic indicators (actions)" (such as reducing food or clothing or entertainment) and analyzes the resulting results to specify the best consumption item that can reduce your overall expenses. You can extend it to the field of big data and imagine how a person in charge makes so-called "data-driven" decisions by observing the impact of multiple dynamic indicators in front of him.

Batch processing: Although batch data processing has been around since the mainframe era, it has gained more significance in the era of big data, where large amounts of data need to be processed. Batch data processing is an efficient way to process large amounts of data, such as a bunch of transaction data collected over a period of time. Distributed computing (Hadoop), discussed later, is a specialized method for processing batch data.

Cassandra is a popular open source data management system developed and operated by the Apache Software Foundation. Apache has mastered many big data processing technologies, and Cassandra is a system they designed specifically to process large amounts of data between distributed servers.

Cloud computing: Although the term cloud computing is now a household name, there is no need to elaborate on it here, but for the sake of completeness, the author still includes the term cloud computing here. Essentially, software or data is processed on a remote server, and these resources can be accessed from anywhere on the Internet, then it can be called cloud computing.

Cluster computing: This is a figurative term to describe computing using the rich resources of a cluster of multiple servers. More technically, in the context of cluster processing, we may discuss nodes, cluster management layer, load balancing, parallel processing, etc.

Dark data: This is a coined term that, in my opinion, is used to scare people and make senior management sound obscure. Basically, dark data refers to all the data that companies accumulate and process that is actually not used at all. In this sense, we call it "dark" data, which may not be analyzed at all. This data can be information from social networks, call center records, meeting minutes, etc. Many estimates suggest that 60% to 90% of all corporate data may be dark data, but no one actually knows.

Data lake: When I first heard this term, I really thought it was an April Fools' joke. But it is really a term. So a data lake is a knowledge base that stores company-level data in a large amount of original format. Here we introduce the data warehouse. A data warehouse is a concept similar to the data lake mentioned here, but the difference is that it stores structured data that has been cleaned and integrated with other resources. Data warehouses are often used for general data (but not necessarily). It is generally believed that a data lake can make it easier for people to access the data you really need, and you can also more easily process and use them effectively.

Data mining: Data mining is about the process of finding meaningful patterns and insights from a large set of data using sophisticated pattern recognition techniques. It is closely related to the "analysis" mentioned above. In data mining, you will first mine the data and then analyze the results. To get meaningful patterns, data miners use statistics (a classic old method), machine learning algorithms, and artificial intelligence.

Data Scientist: Data Scientist is a very sexy profession nowadays. It refers to those who can extract raw data (this is what we call data lake) and then understand, process and draw insights. Some of the skills required by data scientists can be said to be only possessed by super talents: analytical ability, statistics, computer science, creativity, storytelling ability and the ability to understand business context. No wonder these people are paid so much.

Distributed File System: Big data is too large to be stored in a single system. A distributed file system is a file system that can store large amounts of data on multiple storage devices, which can reduce the cost and complexity of storing large amounts of data.

ETL: ETL stands for Extract, Transform, and Load. It refers to the process of extracting raw data, converting it into a form that is “fit for use” by cleaning/enriching it, and loading it into a suitable repository for the system to use. Even though ETL originated from data warehouses, this process is also used when acquiring data, for example, from external sources in big data systems.

Hadoop: When people think about big data, they immediately think of Hadoop. Hadoop is an open source software framework (with a cute elephant logo) consisting of the Hadoop Distributed File System (HDFS), which allows the storage, abstraction, and analysis of big data using distributed hardware. If you really want to impress someone with this thing, you can tell him that YARN (Yet Another Resource Scheduler), as the name suggests, is another resource scheduler. I am really impressed by the people who came up with these names. The Apache Foundation, which came up with Hadoop, is also responsible for Pig, Hive, and Spark (these are the names of some software). Are you not amazed by these names?

In-memory computing: It is generally believed that any computation that does not involve I/O access will be faster. In-memory computing is a technique that moves all working data sets into the collective memory of the cluster, avoiding writing intermediate results to disk during the computation process. Apache Spark is an in-memory computing system, which has great advantages over I/O-bound systems such as Mapreduce.

Internet of Things (IoT): The latest buzzword is the Internet of Things (IoT). IoT is the interconnection of computing devices in embedded objects (such as sensors, wearable devices, cars, refrigerators, etc.) through the Internet, which can send and receive data. IoT generates a huge amount of data, bringing many opportunities for big data analysis.

Machine Learning: Machine learning is a method of designing systems that can learn, adjust, and improve based on the data they are fed. Using set predictive and statistical algorithms, they continuously approximate the "correct" behavior and ideas, and they can improve further as more data is fed into the system.

MapReduce: MapReduce may be a little difficult to understand, so let me try to explain it. MapReduceMapReduce is a programming model that is best understood by noting that Map and Reduce are two different processes. In MapReduce, the program model first divides the large data set into small pieces (these small pieces are called "tuples" in technical terms, but I will try to avoid obscure technical terms when describing them), and then these small pieces are distributed to different computers in different locations (that is, the cluster described earlier), which is necessary for the Map process. The model then collects each calculation result and "reduces" them into one part. The data processing model of MapReduce is inseparable from the Hadoop distributed file system.

Non-relational database (NoSQL): This term sounds almost like the opposite of "SQL, Structured Query Language", which is required for traditional relational database management systems (RDBMS), but NOSQL actually means "more than SQL". NoSQL actually refers to database management systems that are designed to handle large amounts of data without structure (or "schema", outline). NoSQL is suitable for big data systems because large-scale unstructured databases require the flexibility and distributed-first characteristics of NoSQL.

R: Could anyone come up with a worse name for a programming language? R is one such language. However, R is a language that works very well for statistical work. If you don’t know R, don’t call yourself a data scientist. Because R is one of the most popular programming languages ​​for data science.

Spark (Apache Spark): Apache Spark is a fast in-memory data processing engine that can efficiently perform stream processing, machine learning, and SQL workloads that require iterative access to a database. Spark is usually much faster than MapReduce, which we discussed earlier.

Stream processing: Stream processing is designed to process streaming data continuously. Combined with stream analysis technology (the ability to continuously calculate numerical and statistical analysis), stream processing methods are particularly capable of real-time processing of large-scale data.

Structured vs Unstructured Data: This is one of the contrasts in big data. Structured data is basically any data that can be placed in a relational database, organized in such a way that it can be associated with other data through tables. Unstructured data refers to any data that cannot be placed in a relational database, such as email messages, social media status, human speech, etc.

Part 2 (50 terms)

This article is a continuation of the previous one. Due to the overwhelming response to the previous article, I decided to introduce 50 more related terms. Here is a brief review of the terms covered in the previous article: algorithm, analysis, descriptive analysis, preprocessing analysis, predictive analysis, batch processing, Cassandra (a large-scale distributed data storage system), cloud computing, cluster computing, dark data, data lake, data mining, data scientist, distributed file system, ETL, Hadoop (a software platform for developing and running large-scale data processing), in-memory computing, Internet of Things, machine learning, Mapreduce (one of the core components of hadoop), NoSQL (non-relational database), R, Spark (computing engine), stream processing, structured vs unstructured data.

Let’s move on to learn about 50 more big data terms.

The Apache Software Foundation (ASF) provides many open source projects for big data, currently more than 350. Explaining all of them would take a long time, so I will just pick out some popular terms.

Apache Kafka: Named after the Czech writer Franz Kafka, it is used to build real-time data pipelines and streaming applications. The reason why it is so popular is that it can store, manage and process data streams in a fault-tolerant way, and it is said to be very "fast". Given that the social network environment involves a lot of data stream processing, Kafka is currently very popular.

Apache Mahout: Mahout provides a library of pre-made algorithms for machine learning and data mining, and can also be used as an environment for creating more algorithms. In other words, the best environment for machine learning geeks.

Apache Oozie: In any programming environment, you need some workflow system to schedule and run jobs in a predefined way and with defined dependencies. Oozie provides exactly that for big data jobs written in languages ​​like pig, MapReduce, and Hive.

Apache Drill, Apache Impala, Apache Spark SQL: All three open source projects provide fast and interactive SQL like interaction with Apache Hadoop data. These features are very useful if you already know SQL and work with data stored in big data formats (i.e. HBase or HDFS). Sorry, this is a bit weird.

Apache Hive: Do you know SQL? If so, you are good to go with Hive. Hive helps to read, write, and manage large datasets residing in distributed storage using SQL.

Apache Pig: Pig is a platform for creating, querying, and executing routines on large distributed datasets. The scripting language used is called Pig Latin (I am not kidding, trust me). Pig is said to be easy to understand and learn. But I doubt how much can be learned?

Apache Sqoop: A tool for moving data from Hadoop to non-Hadoop data stores such as data warehouses and relational databases.

Apache Storm: A free and open source real-time distributed computing system. It makes it easier to process unstructured data while using Hadoop for batch processing.

Artificial Intelligence (AI): Why is AI here? Isn't this a separate field, you might ask? All these trends are closely connected, so we'd better sit back and learn, right? AI develops intelligent machines and software in a combination of hardware and software that can sense the environment and take necessary actions when needed, constantly learning from these actions. Doesn't it sound a lot like machine learning? Be confused with me.

Behavioral Analytics: Have you ever wondered how Google serves ads for products/services you need? Behavioral Analytics focuses on understanding what consumers and applications do, and how and why they act a certain way. This involves understanding our surfing patterns, social media interactions, and our online shopping activities (shopping carts, etc.), connecting these unrelated data points, and trying to predict outcomes. As an example, after I found a hotel and emptied my shopping cart, I received a call from the resort vacation line. Need I say more?

Brontobytes: 1 followed by 27 zeros, this is the size of the storage unit of the future digital world. And here we are, let's talk about Terabyte, Petabyte, Exabyte, Zetabyte, Yottabyte and Brontobyte. You must read this article to understand these terms in depth.

Business Intelligence: I will reuse Gartner’s definition of BI because it explains it very well. Business Intelligence is an umbrella term that includes applications, infrastructure, tools, and best practices that access and analyze information to improve and optimize decision making and performance.

Biometrics: This is a James Bondish technology combined with analytical technology to identify people through one or more physical characteristics of the human body, such as facial recognition, iris recognition, fingerprint recognition, etc.

Clickstream analytics: This is used to analyze online click data as users browse the web. Ever wonder why certain Google ads linger even when you switch websites? Because Google knows what you’re clicking on.

Cluster Analysis is an exploratory analysis that attempts to identify the structure of the data. It is also called segmentation analysis or classification analysis. More specifically, it attempts to identify homogenous groups of cases, i.e. observations, participants, respondents. Cluster analysis is used to identify groups of cases if the groupings are previously unknown. Because it is exploratory, it does distinguish between dependent and independent variables. The different cluster analysis methods provided by SPSS can handle binary, nominal, ordinal, and scale (interval or ratio) data.

Comparative Analytics: Since the key to big data is analysis, I will go into more depth in this article. Comparative Analytics, as the name implies, is the use of statistical techniques such as pattern analysis, filtering, and decision tree analysis to compare multiple processes, data sets, or other objects. I know it involves fewer and fewer technologies, but I still can't completely avoid using the terminology. Comparative Analytics can be used in the healthcare field to give more effective and accurate medical diagnoses by comparing large amounts of medical records, documents, images, etc.

Connection Analytics: You must have seen the spider web like graphs connecting people to topics to identify influencers of a particular topic. Connection Analytics can help discover relevant connections and influences between people, products, systems in a network, and even data combined with multiple networks.

Data Analyst: Data Analyst is a very important and popular job which is responsible for collecting, editing and analyzing data besides preparing reports. I will write a more detailed article about Data Analyst in a future post.

Data Cleansing: As the name suggests, data cleaning involves detecting and correcting or removing inaccurate data or records in the database, then remember "dirty data". With the help of automated or manual tools and algorithms, data analysts can correct and further enrich the data to improve data quality. Remember, dirty data will lead to incorrect analysis and poor decision making.

Data as a Service (DaaS): We have Software as a Service (SaaS), Platform as a Service (PaaS), and now we have DaaS, which means: Data as a Service. By providing users with on-demand access to cloud data, DaaS providers can help us get high-quality data quickly.

Data virtualization: This is a data management method that allows an application to extract and manipulate data without knowing the technical details (such as where the data is stored and in what format). For example, social networks use this method to store our photos.

Dirty Data: Since big data is so attractive, people have started to add other adjectives to data to form new terms, such as dark data, dirty data, small data, and now smart data. Dirty data is unclean data, in other words, inaccurate, duplicated and inconsistent data. Obviously, you don't want to mess with dirty data. So, fix it as soon as possible.

Fuzzy logic: How often are we certain about something, like 100% right? Very rarely! Our brain aggregates data into partial facts, which are further abstracted into some kind of threshold that determines our decision. Fuzzy logic is a type of computing that aims to mimic the human brain by gradually eliminating partial facts, as opposed to "0" and "1" in things like Boolean algebra.

Gamification: In a typical game, you have an element like points to compete with others, and there are clear rules of the game. Gamification in big data is the use of these concepts to collect, analyze data or motivate players.

Graph Databases: Graph databases use concepts such as nodes and edges to represent people and businesses and the relationships between them to mine data in social media. Have you ever been amazed at how Amazon tells you what other people are buying when you buy a product? Yes, that's a graph database.

Hadoop User Experience (Hue): Hue is an open source interface that makes it easier to use Apache Hadoop. It is a web-based application; it has a file browser for distributed file systems; it has a task designer for MapReduce; it has a framework for scheduling workflows, Oozie; it has a shell, an Impala, a Hive UI, and a set of Hadoop APIs.

High-Performance Analytical Applications (HANA): This is a hardware and software in-memory platform designed by SAP for big data transmission and analysis.

HBase: A distributed column-oriented database. It uses HDFS as its underlying storage and supports both batch computing using MapReduce and batch computing using transaction interaction.

Load balancing: Distributing the load across multiple computers or servers to achieve optimal results and utilization of the system.

Metadata: Metadata is data that describes other data. Metadata summarizes basic information about data, making it easier to find and use specific data instances. For example, the author, creation date, modification date, and size of the data are basic document metadata. In addition to document files, metadata is also used for images, videos, spreadsheets, and web pages.

MongoDB: MongoDB is a cross-platform open source database that is oriented to the text data model rather than the traditional table-based relational database. The main design purpose of this database structure is to make the integration of structured data and unstructured data in specific types of applications faster and easier.

Mashup: Fortunately, this term is similar to the word "mashup" we use in daily life. Essentially, a mashup is a method of combining different data sets into a single application (for example: combining real estate data with location data and population data). This can really make for cool visualizations.

Multi-Dimensional Databases: This is a database optimized for online analytical processing (OLAP) and data warehouse. If you don’t know what a data warehouse is, I can explain that a data warehouse is nothing else but a centralized storage of data from multiple data sources.

MultiValue Databases: A multivalue database is a non-relational database that can directly understand three-dimensional data, which is great for directly manipulating HTML and XML strings.

Natural Language Processing: Natural Language Processing is a software algorithm designed to enable computers to more accurately understand everyday human language, allowing humans to interact with computers more naturally and effectively.

Neural Network: According to this description (http://neuralnetworksanddeeplearning.com/), neural networks are a very beautiful programming paradigm inspired by biology, which allows computers to learn from observed data. It has been a long time since someone said a programming paradigm is beautiful. In fact, neural networks are models inspired by real-life brain biology... A term closely related to neural networks is deep learning. Deep learning is a collection of learning techniques in neural networks.

Pattern Recognition: Pattern recognition occurs when an algorithm needs to identify regressions or patterns in large-scale data sets or on different data sets. It is closely related to machine learning and data mining, and is even considered synonymous with the latter two. This visibility can help researchers discover some profound patterns or draw some conclusions that may be considered absurd.

Radio Frequency Identification (RFID): RFID is a type of sensor that uses non-contact wireless radio frequency electromagnetic fields to transmit data. With the development of the Internet of Things, RFID tags can be embedded in any possible "things", which can generate a lot of data that needs to be analyzed. Welcome to the world of data.

Software as a Service (SaaS): Software as a Service allows service providers to host applications on the Internet. SaaS providers deliver services in the cloud.

Semi-structured data: Semi-structured data refers to data that is not formatted in a traditional way, such as the data fields associated with traditional databases or common data models. Semi-structured data is not completely raw data or completely unstructured data. It may contain some data tables, tags, or other structural elements. Examples of semi-structured data are graphs, tables, XML documents, and emails. Semi-structured data is very popular on the World Wide Web and can often be found in object-oriented databases.

Sentiment Analysis: Sentiment analysis involves capturing, tracking, and analyzing the emotions, feelings, and opinions expressed by consumers in various types of interactions and documents, such as social media, customer representative phone interviews, and surveys. Text analysis and natural language processing are typical technologies in the sentiment analysis process. The goal of sentiment analysis is to identify or evaluate attitudes or emotions held toward a company, product, service, person, or event.

Spatial analysis: Spatial analysis refers to the analysis of spatial data to identify or understand patterns and regularities in data distributed in geometric space. This type of data includes geometric data and topological data.

Stream processing: Stream processing is designed to perform real-time "continuous" queries and processing on "streaming data". In order to continuously perform real-time numerical calculations and statistical analysis on large amounts of streaming data at a very fast speed, the streaming data on social networks has a clear demand for stream processing.

Smart Data is data that is useful and actionable after being processed by some algorithms.

Terabyte: This is a relatively large unit of digital data, 1TB equals 1000GB. It is estimated that 10TB could hold all the printed materials in the Library of Congress, while 1TB could hold the entire Encyclopedia Brittanica.

Visualization: With proper visualization, the raw data can be used. Of course, visualization here is not just a simple chart. It is a complex chart that can contain many variables of the data while being readable and understandable.

Yottabytes: Nearly 1,000 Zettabytes, or 2.5 trillion DVDs. All digital storage today is about 1 Yottabyte, and that number is doubling every 18 months.

Zettabytes: Nearly 1,000 Exabytes, or 1 billion Terabytes.

Original link: http://dataconomy.com/2017/02/25-big-data-terms/

http://dataconomy.com/2017/07/75-big-data-terms-everyone-know/

<<:  From decision trees to random forests: the principle and implementation of tree-based algorithms

>>:  The second round of 51CTO developer community administrator recruitment has been successfully completed

Recommend

Insurance industry uses Tencent Guangdiantong as an example!

Driven by the improvement of national insurance a...

How to use the media to create internet celebrity products?

Every company wants to build its own internet cel...

Fuse Clip: Do you know these pseudo-original techniques for SEO?

As an SEOer, how to make the website better inclu...

Notes from an e-commerce private domain operator!

Maybe everyone now knows what public domain and p...

Analysis of Himalaya audio content operations!

The author has conducted an in-depth analysis of ...

App promotion: How to do user retention analysis? Make these two points clear!

There are two things you need to clarify before d...