2014329620025_王炎_proposal Version 0 |
|
👤 Author: by 942555271qqcom 2018-01-07 08:57:21 |
yle="text-align: center;">Big Data Stream Processing Technology and System Deployment
yle="text-align: left;">Abstract
yle="text-align: left;"> The era of Big Data has begun, the users are more eager for fresh and low-latency processing results than ever. For this reason, this paper reviews the recent stream processing models for Big Data and focuses on the parallel-distributed processing models, and presents their design goals and architectures. Moreover, this paper discusses the main challenges in designing a parallel-distributed stream processing model and future works.
yle="text-align: left;">Overall Aim
yle="text-align: left;"> In the application scenarios such as the Internet / mobile Internet and the Internet of Things, the complex business requirements such as personalized service, user experience enhancement, intelligent analysis, and decision-making in the field of business put forward higher requirements for big data processing technologies. To meet these needs, big data processing systems must return processing results in milliseconds or even microseconds. Taking the largest bank card receiving agency, CUP, for example, its daily trading volume is nearly 100 million. It needs real-time risk monitoring of its more than 5.4 million merchants. While ensuring that these merchants are engaged in compliance with the order collection business, the maximum limit To protect the legitimate rights and interests of individual users. Such high concurrency, big data, high real-time application requirements pose a serious challenge to big data processing systems. The T + 1 post-crisis control system previously used by CUP has the following problems: high risk detection hysteresis (the risk can be found the next day, the damage has been caused), the processing time is long (risk identification can not be completed after more than ten hours), and the long-cycle history can not be handled Data (which can only analyze flowing water data in recent days), and major defects such as the inability to support complex rules (simple rules that only support cumulative summation). Therefore, it is urgent to develop a new risk management system to focus on achieving low latency (identifying sudden risks within 1 min), high real-time (processing results returned within 100 ms), long cycles (up to 10 years Historical period data) as well as support for high-complexity rules such as variance, standard deviation, K-th moment, maximum continuous statistics, and more. This goal can be abstracted as a big data science problem: how to implement low-latency, high-real-time Ad-Hoc query analysis on a complete large dataset.
yle="text-align: left;">Methodology
yle="text-align: left;"> The existing big data processing system can be divided into two categories: batch big data system and big data processing system. The batch processing big data system represented by Hadoop needs to first assemble the data into batches and load it to the analytical data warehouse after batch preprocessing so as to perform high-performance real-time query. Although these systems can achieve efficient ad hoc queries on a complete large dataset, they can not find out the latest real-time data and have high data hysteresis. Compared with the batch big data system, the stream processing big data system represented by Spark Streaming, Storm, Flink real-time data through stream processing, one by one loaded into the high-performance memory databa
yle="text-align: left;"> Implementing a system-level approach that integrates both batch and stream processing systems and is transparent to the application requires several technical challenges to be overcome.
yle="text-align: left;"> (1) incremental calculation of complex indicators
Although count, sum, average and other indicators can be combined to achieve query results, most of the complex indicators such as variance, standard deviation, entropy and so on can not rely on simple merging to complete the fusion of query results. Again, when it comes to complex metrics that involve hot data dimensions and long cycle time windows, multiple recalculations can result in significant computational overhead.
yle="text-align: left;"> (2) Distributed memory-ba
Extensive scheduling strategies, such as conventions to import streaming data to a batch system at a fixed time each day, can result in a significant waste of memory resources. It is imperative to study a fine-grained, cohesive storage strategy ba
yle="text-align: left;"> (3) Dynamic data processing of multi-scale time window drift
Data query requests from the business system involve time fr
yle="text-align: left;"> (4) high availability, highly scalable memory computing
Memory-ba
yle="text-align: left;"> Streaming big data real-time processing technology has made a series of breakthroughs in the above fields. The technology provides fast processing of dynamic data ba
yle="text-align: left;">Expected Results
yle="text-align: left;"> The financial risk control and fraud prevention technology system ba
yle="text-align: left;"> The real-time machine defense system ba
yle="text-align: left;"> In addition, the real-time streaming big data processing platform ba
yle="text-align: left;"> The "hot data" brings unparalleled value, data from the beginning, its application value exponentially declining over time, how to fully apply the "hot data" is a renaissance affairs, is a long-term task, but also the flow of big data processing Great for technology. The "streaming cubic" real-time streaming big data processing technology and platform have broad application prospects in the fields of finance, telecommunications, transportation, public security, customs, and network security, which need to introduce the "things" perception analysis and decision-making model.
yle="text-align: left;">Summary
yle="text-align: left;"> ba
yle="text-align: left;"> Real-time processing of streaming big data is an important starting point for information technology in the era of big data. Intelligent systems that implement "sensing", "analyzing", "judging," and "making decisions" functions require the support of real-time streaming big data processing platforms. In addition, streaming big data real-time processing provides computational fr
Please login to reply. Login