yvsou.com

yle="text-align: center;">Big Data Stream Processing Technology and System Deployment

yle="text-align: left;">Abstract

yle="text-align: left;"> The era of Big Data has begun, the users are more eager for fresh and low-latency processing results than ever. For this reason, this paper reviews the recent stream processing models for Big Data and focuses on the parallel-distributed processing models, and presents their design goals and architectures. Moreover, this paper discusses the main challenges in designing a parallel-distributed stream processing model and future works.

yle="text-align: left;">Overall Aim

yle="text-align: left;"> In the application scenarios such as the Internet / mobile Internet and the Internet of Things, the complex business requirements such as personalized service, user experience enhancement, intelligent analysis, and decision-making in the field of business put forward higher requirements for big data processing technologies. To meet these needs, big data processing systems must return processing results in milliseconds or even microseconds. Taking the largest bank card receiving agency, CUP, for example, its daily trading volume is nearly 100 million. It needs real-time risk monitoring of its more than 5.4 million merchants. While ensuring that these merchants are engaged in compliance with the order collection business, the maximum limit To protect the legitimate rights and interests of individual users. Such high concurrency, big data, high real-time application requirements pose a serious challenge to big data processing systems. The T + 1 post-crisis control system previously used by CUP has the following problems: high risk detection hysteresis (the risk can be found the next day, the damage has been caused), the processing time is long (risk identification can not be completed after more than ten hours), and the long-cycle history can not be handled Data (which can only analyze flowing water data in recent days), and major defects such as the inability to support complex rules (simple rules that only support cumulative summation). Therefore, it is urgent to develop a new risk management system to focus on achieving low latency (identifying sudden risks within 1 min), high real-time (processing results returned within 100 ms), long cycles (up to 10 years Historical period data) as well as support for high-complexity rules such as variance, standard deviation, K-th moment, maximum continuous statistics, and more. This goal can be abstracted as a big data science problem: how to implement low-latency, high-real-time Ad-Hoc query analysis on a complete large dataset.

yle="text-align: left;">Methodology

yle="text-align: left;"> The existing big data processing system can be divided into two categories: batch big data system and big data processing system. The batch processing big data system represented by Hadoop needs to first assemble the data into batches and load it to the analytical data warehouse after batch preprocessing so as to perform high-performance real-time query. Although these systems can achieve efficient ad hoc queries on a complete large dataset, they can not find out the latest real-time data and have high data hysteresis. Compared with the batch big data system, the stream processing big data system represented by Spark Streaming, Storm, Flink real-time data through stream processing, one by one loaded into the high-performance memory database query. Such systems can be efficient real-time data on the new pre-analysis model of the query, the data hysteresis low. However, due to memory limitations, the system discards the original historical data and can not support Ad-Hoc query analytics on the full large dataset. Therefore, the development of fast, efficient, intelligent and self-controllable features real-time streaming big data technology and platform is a top priority.

yle="text-align: left;"> Implementing a system-level approach that integrates both batch and stream processing systems and is transparent to the application requires several technical challenges to be overcome.

yle="text-align: left;"> (1) incremental calculation of complex indicators
Although count, sum, average and other indicators can be combined to achieve query results, most of the complex indicators such as variance, standard deviation, entropy and so on can not rely on simple merging to complete the fusion of query results. Again, when it comes to complex metrics that involve hot data dimensions and long cycle time windows, multiple recalculations can result in significant computational overhead.

yle="text-align: left;"> (2) Distributed memory-based parallel computing
Extensive scheduling strategies, such as conventions to import streaming data to a batch system at a fixed time each day, can result in a significant waste of memory resources. It is imperative to study a fine-grained, cohesive storage strategy based on real-time perception of progress, The earth optimizes and enhances the memory usage efficiency of converged systems.

yle="text-align: left;"> (3) Dynamic data processing of multi-scale time window drift
Data query requests from the business system involve time frames of various scales, such as "the amount of the latest 5 card transactions", "password retries in the last 10 minutes", "average transaction amount in the past 10 years", and the like. It is imperative to study and implement a method that supports multiple time-window scales (seconds to decades), multiple window shifts (data-driven, system-clock driven), each time a query request recalculates the result. ) Dynamic data real-time processing methods to quickly respond to ad hoc query requests from business systems.

yle="text-align: left;"> (4) high availability, highly scalable memory computing
Memory-based media can greatly enhance data analysis and processing capabilities. However, due to its volatile nature, multiple copies are generally required to implement memory-based HA schemes, which makes "how to ensure the consistency of different copies" solved problem. In addition, "how to rebalance the cluster while providing uninterrupted service" is also a technical challenge to be solved when the cluster is out of memory or some nodes fail. It is imperative to study distributed multi-copy consistency protocols and self-balancing smart partitioning algorithms to further enhance the availability and scalability of stream processing clusters.

yle="text-align: left;"> Streaming big data real-time processing technology has made a series of breakthroughs in the above fields. The technology provides fast processing of dynamic data based on time window drift and supports counting, summing, average, maximum, minimum, variance, standard deviation, K-order centroid, increment / decrement, maximum continuous increment / decrement, uniqueness discrimination, acquisition, filtering and other distributed statistical calculation models, and realizes the efficient management of real-time analysis and processing model sets such as complex events and context processing.

yle="text-align: left;">Expected Results

yle="text-align: left;"> The financial risk control and fraud prevention technology system based on Flow Cubes includes technologies such as fingerprint identification, proxy detection, biometrics, correlation analysis, machine learning and other technologies, knowledge (such as counterfeit card fraud, counterfeit card fraud, credit card Cash fraud, marketing fraud and other rules and models), data (such as false mobile phone data, proxy IP data, P2P dishonest data and other identification data) three plates. In the technical part, the fingerprint technology of the device collects the hardware and software related elements of the device through active and passive mixing, and issues a global unique fingerprint code for each device in combination with algorithms such as probability theory. These fingerprint codes play a role in anti-fraud whole process Very active role; proxy detection technology through a short period of time to scan IP-related port to identify those who open the proxy IP, and access to financial services in these IP identification; biometric identification device by collecting the user's mouse click, touch, Keyboard knocking and other acts to identify the operator is a person or a machine and whether the operator's own problem; correlation analysis technology at the bottom of the graph database through the storage of different nodes and relationship information, and ultimately in the interface through the form of fraudsters association analysis and complex Network analysis; machine learning technology through the supervision of unsupervised machine learning algorithm to improve the accuracy of fraud detection and coverage, combined with the flow cube technology to predict the things in the matter.

yle="text-align: left;"> The real-time machine defense system based on "flow cube" achieves the performance of microsecond (400-800μs) technology by multi-server access water-flow decision making, long cycle data decision, complex rules crawler identification, equipment dimension crawler identification and human- Identify the delay, at the same time with the robot recognition control integration, lightweight access and so on. According to feedback from dozens of customers who have access to the machine defense service, the defense system based on the "Flow Cube" platform has a robot coverage of over 95% and an accuracy rate of 99.9%. The machine defense system intercepts traffic from these network robots, which account for 80% to 90% of the original total traffic among these client business systems, reducing the pressure on their business system servers to 10%. Due to the superior recognition and control of robots based on the "Flow-Cube" machine defense system, the nation's largest ticketing platform is now fully testing this service in the hope of further enhancing its ticketing service capabilities.

yle="text-align: left;"> In addition, the real-time streaming big data processing platform based on "streaming cubic" also has a great accomplishment in the field of intelligent transportation. Real-time analysis of license plate information collected from cameras embedded in all parts of the country, with geographic location information services and geographic information system (GIS) based on the shortest traffic distance calculation, Crack down on illegal and criminal services; and enhance the city's traffic efficiency greatly by analyzing traffic flow information of intersections in real time, controlling traffic lights at each intersection in real time, intelligently transforming tidal lanes and variable lanes.

yle="text-align: left;"> The "hot data" brings unparalleled value, data from the beginning, its application value exponentially declining over time, how to fully apply the "hot data" is a renaissance affairs, is a long-term task, but also the flow of big data processing Great for technology. The "streaming cubic" real-time streaming big data processing technology and platform have broad application prospects in the fields of finance, telecommunications, transportation, public security, customs, and network security, which need to introduce the "things" perception analysis and decision-making model.

yle="text-align: left;">Summary

yle="text-align: left;"> based on batch-style big data, you can constantly learn new knowledge and accumulate new experiences. However, in applying this knowledge and experience, streaming big data is more capable of mining the potential value of "hot data" to a great extent. This makes streaming big data technology with more effective application of promotional value.

yle="text-align: left;"> Real-time processing of streaming big data is an important starting point for information technology in the era of big data. Intelligent systems that implement "sensing", "analyzing", "judging," and "making decisions" functions require the support of real-time streaming big data processing platforms. In addition, streaming big data real-time processing provides computational framework support for deep learning driven by big data. The "streaming cubic" streaming real-time processing platform for big data can support the development of next-generation artificial intelligence unified computing framework integrating various forms of logic reasoning, probability statistics, crowdsourcing, and neural networks.

2014329620025_王炎_proposal Version 0
👤 Author: by 942555271qqcom 2018-01-07 08:57:21

Reversion History