telecom dataset for machine learning

Differences in space utilization and execution time per file type. The first data telecom analysis tool is Excel with a number of powerful features, such as form formation, PivotTable, and VBA. This algorithm was used for classification in this churn predictive model. Excel is a versatile player. The data set used in this article is available in the Kaggle ( CC BY-NC-ND) and contains nineteen columns (independent variables) that indicate the characteristics of the clients of a fictional telecommunications corporation. This result was very good for the company, increased the revenue and decreased the churn rate by about 1.5%. https://doi.org/10.1016/j.dss.2008.06.007. 9. Customer 360 4. The collected data was full of columns, since there is a column for each service, product, and offer related to calls, SMS, MMS, and internet, in addition to columns related to personnel and demographic information. One time payments for large batches that enable you to access historical data for making future predictions. Spark engine was used in most of the phases of the model like data processing, feature engineering, training and testing the model since it performs the processing on RAM. The features of month N are aggregated from the N-month sliding data window (from month 1 to month N). Figure9a displays the distribution of this feature. An Amazon Simple Storage Service (Amazon S3) bucket includes a synthetic IP Data Record (IPDR) dataset, an AWS Glue job converts the datasets, and an Amazon SageMaker instance includes Machine Learning (ML) Jupyter Notebooks. Neighbor Connectivity equation is defined as follow. These experiments are: (1) classification with undersampling technique, (2) classification with oversampling technique, (3) classification without balancing the dataset. 3G, 4G, 5G. Privacy Mathematics for Machine Learning and Data Science is a beginner-friendly Specialization where you'll learn the fundamental mathematics toolkit of machine learning: calculus, linear algebra, statistics, and probability. Improve your targeting capabilities by utilizing our telecom datasets, which enable you to connect with diverse user segments for telecom advertising campaigns. Only about 5% of the entries represent customers churn. The data is divided into three main types which are structured, semi-structured and unstructured. We created many features like percentage of incoming/out-coming calls, SMS, MMS to the competitors and landlines, binary features to show if customers were subscribing some services or not, rate of internet usage between 2G, 3G and 4G, number of devices used each month, number of days being out of coverage, percentage of friends related to competitor, and hundred of other features. (7043, 21) Now let's see the columns in our dataset. Spark engine is used for both statistical and social features, the library used for SNA features is the Graph Frame. In addition, there are many other advantages. [12] presented an advanced methodology of data mining to predict churn for prepaid customers using dataset for call details of 3333 customers with 21 features, and a dependent churn parameter with two values: Yes/No. The record is kept by the telecom companies which involve and includes call information such as call time, call length, source and destination number, call completion status, consumer billing, service capacity preparation - all of which can be accessed from some commercial telecom datasets. . These types are classified as follow: Customer data It contains all data related to customers services and contract information. 2016. arXiv:1603.02754. How much historical data is needed in features engineering phase? We assumed to set the d value to be 0.85 as mentioned in most of the research [21, 22]. Xie J, Rojkova V, Pal S, Coggeshall S. A combination of boosting and bagging for kdd cup 2009fast scoring on a large database. IEEE Access. The dataset is aggregated to extract features for each customer. Customer retention, loyalty, and satisfaction in the German mobile cellular telecommunications market. Based on the directed graphs, we use PageRank [19], Sender Rank [20] algorithms to produce two features for each graph. The pipeline used for this example consists of 8 steps: Step 1: Problem Definition The highest AUC value reached by using only SNA features was 75.3%. The computational complexity of SNA measures is very high due to the nature of the iterative calculations done on a big scale graph, as mentioned in Eqs. The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. This case probably happens because the customer needs to make sure that most of his important incoming calls and contacts have moved to the new line. & Aljoumaa, K. Customer churn prediction in telecom using machine learning in big data platform. The value of k was 10. Eur Phys J B. Datasets containing 20 common telecom customer service scenarios (intents) available in 31 languages. Jaccard measure: normalize the number of mutual friends based on the union of the both friends lists, [25]. AKA took the role of performing the literature review, building the big data platform, working on the proposed churn model. With the advancement in the field of machine learning and artificial intelligence, the possibilities to predict customer churn has increased significantly. In: Data mining and knowledge discovery handbook. 2. J Big Data 6, 28 (2019). It may often be realized without human guidance and explicit reprogramming. Since telecom data majorly revolves around user data, the recent introduction of GDPR in Europe has made the accumulation of telecom data more difficult.. The dataset used in this study is small and no missing values existed. In addition, the data sources were from different types, and gathering theminData Warehouse was a very hard process so that adding new features for Data Mining algorithms required a long time, high processing power, and more storage capacity. Many providers are willing to also create custom quotes for more challenging use cases. Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, Hawalah A, Hussain A. This could be justified because some customers are using these personalGSMs for business objectives. Enable interpretability techniques for engineered features. During the time and changing the role of telecom operators, from service and infrastructure carriers to communication service providers handling data, voice, and content transfer. Parquet file type was the chosen format type that gave the best results. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This social network is also used to find similar customers in the network based on mutual friend concept. Identification of top-k influential communities in big networks. Real-time telecom data retrieval and interpretation become more complex as new data sources are used, such as: device data which includes traffic analysis, such as deep packet inspection and SMS, site, search, and email. Amin et al. The total count of the sample where 5 million customers containing 300,000 churned customers and 4,700,000 active customers. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree GBM and Extreme Gradient Boosting XGBOOST. The experts in marketing decided to predict the churn before 2 months of the actual churn action, in order to have sufficient time for proactive action with these customers. Elements of effective machine learning datasets in astronomy by Bernie Boscoe et al. The method of preparation and selection of features and entering the mobile social network features had the biggest impact on the success of this model, since the value of AUC in SyriaTel reached 93.301%. Hortonworks data platform HDPbig data framework. Set up the resources for ML code development and execution. They gather the data related to dropped calls, bandwidth issues, poor download times, and the like to optimize their services with proper capacity planning, equipment monitoring, and preventive maintenance. The weighted Page Rank equation is defined as follows, While the weighted Sender Rank equation is defined as follow. telecom.columns.values Output: Become a Full Stack Data Scientist Transform into an expert and significantly impact the world of data science. Telecom operators have long history and experience against fraud activities. Many other methods were tested, but this applied approach gave the best performance of the four algorithms. Receiver operating characteristic curve for each classification algorithm. Bott. You can add/remove the independent variables depending on how . AJ and KJ took on a supervisory role and oversaw the completion of the work. Learning JAX in 2023: Part 3 A Step-by-Step Guide to Training Your First Machine Learning Model with JAX. These values indicate the importance of the customers since the higher values of PR(m) and SR(m) corresponds to the higher importance of customers in the social network. System evaluation We evaluated the system by using new up to date dataset. Machine Learning Case Study: Telco Customer Churn Prediction The results showthat most of them were related to Cafes, Restaurants, Shaving shops, Hairdressers, Libraries, Game Shops, Medical clinics, and others. The data contained transactions for all customers during nine months before the prediction baseline. Customer churn prediction in telecom using machine learning in big data platform, $$\begin{aligned} PR(m)=(1-d)+d*\sum _{n\in N(m)}\frac{W_{n\rightarrow m}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}} PR(n) \end{aligned}$$, $$\begin{aligned} SR(m)=(1-d)+d*\sum _{n\in N(m)}\frac{W_{m\rightarrow n}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}} SR(n) \end{aligned}$$, $\frac{W_{n\rightarrow m}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}}$, $$\begin{aligned} NC(m)= \frac{\sum _{k\in N(m)} \left| N(k) \right| }{\left| N(m) \right| } \end{aligned}$$, $$\begin{aligned} LC(m)= \sum _{k\in N(m)} \frac{\left| N(m)\cap N(k) \right| }{ \left| N(m) \right| * (\left| N(m) \right| -1)} \end{aligned}$$, $$\begin{aligned} JS(m,k) = \frac{\left| N(m)\cap N(k)\right| }{\left| N(m)\cup N(k)\right| } \end{aligned}$$, $$\begin{aligned} JS(m,k) = \frac{\left| N(m)\cap N(k)\right| }{\sqrt{\left| N(m) \right| \left| N(k) \right| }} \end{aligned}$$, https://doi.org/10.1186/s40537-019-0191-6, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html, https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, https://spark.apache.org/docs/latest/sql-programming-guide.html, https://doi.org/10.1186/s40537-016-0050-7, https://doi.org/10.1140/epjb/e2004-00111-4, https://doi.org/10.1016/j.dss.2008.06.007, https://doi.org/10.1016/S0169-7552(98)00110-X, http://creativecommons.org/licenses/by/4.0/. The target class is unbalanced, and this could cause a significant negative impact on the final models. Telephone (wired and wireless) networks, satellite companies, cable providers and internet service providers are the biggest corporations in the telecommunications field. Figure 7b shows the distribution of this feature where the Average RAT is lower for most of the churners compared with that ofnon-churners. The Social Network Analysis features had a different scenario, whenthe best sliding window to build the social graph and extract appropriate SNA features was duringthe last four months before the baseline, as shown in Fig. What are tools for Telecom Data analytics? We found that more than half of the features have more than 98% of missing values. Use Python to interpret & explain models (preview) - Azure Machine Learning The damping factor d is used here to prevent these Sinks from getting higher SR or PR values each round of calculation. in fraud detection and consumer behavioral analysis. Telecom Data | Kaggle Qureshii SA, Rehman AS, Qamar AM, Kamal A, Rehman A. Telecommunication subscribers churn prediction model using machine learning. 11c, 12c, we belive that Social Network Analysis features have a good contribution to increase the performance of churn prediction model, sincethey gave a different insight to the customer from the social point of view. Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful features are included. Leskovec J, Backstrom L, Kumar R, Tomkins A. Decis Support Syst. (1), the nodes with zero outgoing edges are the Sinks while in Eq. The results were analyzed to compare the performance regarding the different sizes of training data. On the other hand, predicting the customers who are likely to leave the company will represent potentially large additional revenue source if it is done in the early phase [3]. On the other hand, using Parquet file type with Snappy Compression technique gave the best space utilization. In order to build the churn predictive system at SyriaTl, a big data platform must be installed. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The highest values for both measures are selected for each customer ( top Jaccard and Cosine similarity for similar SyriaTel customer and top Jaccard and Cosine similarity for similar MTN customer). Even traditionally, telecom data has always played a greater role in marketing and sales departments. We chose to perform cross-validation with 10-folds for validation and hyperparameter optimization. Many research confirmed that machine learning technology is highly efficient to predict this situation. Enroll for Free. Lets not forget the CRM systems which make use of telecom data to help you nurture your leads. The first idea was to aggregate values of columns per month (average, count, sum, max, min ) for each numerical column per customer, and the count of distinct values for categorical columns. This technique is applied through learning from previous data [6, 7]. In addition, we encountered another problem: the data was not balanced. Machine learning algorithms learn from data. Welcome to the UC Irvine Machine Learning Repository! How to Prepare Data For Machine Learning 2016;3(1):16. https://doi.org/10.1186/s40537-016-0050-7. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial . ChatGPT, machine learning and other popular AI terms to know - CNBC It's mostly used by insurance companies and marketers e.g. Apache Flume is a distributed system used to collect and move the unstructured (CSV and text) and semi-structured (JSON and XML) data files to HDFS. Idris [14] proposed an approach based on genetic programming with AdaBoost to model the churn problem in telecommunications. https://doi.org/10.1016/S0169-7552(98)00110-X. This process took the longest time due to the huge numbers of columns. Leverage the Machine Learning for Telecommunication guidance out of-the-box, or for building your own machine learning guidance. Social Network Analysis features Data transformation and preparation are performed to summarize the connections between every two customers and build a social network graph based on CDR data taken for the last 4 months. Complaints database provides all complaints submitted and statistics inquiries related to coverage, problems in offers and packages, and any problem related to the telecom business. Decis Support Syst. [13] proposed a model for prediction based on the Neural Network algorithm in order to solve the problem of customer churn in a large Chinese telecom company which contains about 5.23 million customers. Data providers and vendors listed on Datarade sell Telecom Data products and samples. Mobile Cell Tower Coverage Footprint Europe, Egypt & UAE - Telecom Data by Teragence, RootMetrics Connected Insights: Mobile Network Data for USA, UK, Switzerland, South Korea, IPinfo.io Mobile IP Database | Global Coverage | IP to Mobile Carrier Linkage, Mobile Signal Strength Map Europe - Mobile Network Coverage Data by Teragence, Mobile Technology Coverage Mix Map Europe - Mobile Network Coverage Data by Teragence, Speech recognition data: telecom customer service intent scenarios in 31 languages, ThinkCX | Carrier and ISPs Telecom Market Share Data TeleBreakdown for North American, ThinkCX | Digital Advertising Audiences for North American Telecoms (200M Devices), Top 10 Telecom Data & Analytics Providers, Telecom data - Carrier & ISP (Global) by Redmob, IPinfo.io Mobile IP Database | Global Coverage | IP to Mobile Carrier Linkage by IPinfo, Mobile Cell Tower Coverage Footprint Europe, Egypt & UAE - Telecom Data by Teragence by Teragence. The hyperparameters of the algorithms were optimized using K-fold cross-validation. In spite of that, the traditional Data Warehousesystem still suffers from deficiencies in computing the essential SNA measures on large scale networks. 2005. p. 4853. Thanks for Mr. Mhd Assaf, Mr. Nour Almulhem, Mr.william Soulaiman, Mr. Ammar Asaad, Mr. Soulaiman Moualla, Mr. Ahmad Ali, and Miss. The article contains 5 datasets each for machine learning, computer vision, and NLP By no means is this list exhaustive. Table 4 shows AUC results for the four algorithms on the NotOffered dataset. Figure 1 presents the ecosystem of HDP, where each group of tools is categorized under specific specialization likeData Management, Data Access, Security, Operations andGovernance Integration. In: IEEE international conference on systems, man, and cybernetics (SMC). Methods : This study applied data mining techniques to the NPS dataset from a Malaysian telecommunications company in September 2019 and September 2020, analysing 7776 records with 30 fields to determine which variables were significant for the churn prediction model. The model was prepared and tested through Spark environment by working on a large dataset created by transforming big raw data provided by SyriaTel telecom company. The explanation here relies on the effect of friends on the churn decision, since the affiliation of most of customers friends to the other operator may be evidence of the good reputation or the strong existenceof the competing company in that region or community. Each of these use cases requires related but different ML models and system architecture, depending on their unique needs and environmental constraints. Jaccard similarity equation between customer(m) and customer(k) is defined as follows: Another similarity measure is the Cosine measure whichis similar to Jaccard's. UCI Machine Learning Repository - University of California, Irvine There are two telecom companies in Syria which are SyriaTel and MTN. Essentially, it must be in line with the recent GDPR requirements and should be available in a format that could be used by you and your tools. To apply the third strategy, companies have to decrease the potential of customers churn, known as the customer movement from one provider to another [5]. The features related to IMEI data such as the type of device, the brand, dual or mono device, and how many devices the customer changed were extracted. in fraud detection and consumer behavioral analysis. The majority of related work focused on applying only one method of data mining to extract knowledge, and the others focused on comparing several strategies to predict churn. Compare the top telecom data vendors and companies. Terms and Conditions, This dataset encounters many challenges as follow. The data transformed to HDFS keep in the same format type as it was. Cite this article. While a low d value will make the calculations easier but will give incorrect results. Building a machine learning model is a complex step in the process of applying machine learning methodologies towards solving business problems. Finally, the programming language of R and Python is very strong and scalable. SyriaTel company was interested in this field of study because acquiring a new customer costs six times higher than the cost of retaining the customer likely to churn. Telecom data providers use these methods for data business research. We experimented three scenarios to deal with the unbalance problem which are oversampling, undersampling and without re-balancing. In addition to these records, the data must be linked to the detailed data stored in relational databases that contain detailed information about the customer. The dataset for customers who are most likely predicted to churn, was divided into two datasets (Offered, NotOffered). Damping factor in telecom social graph is used to represent the interaction-through probability .Thefirst part (1-d) represents the chanceto randomly select a sink node while the d is used to make sure that the sum of PageRanks or SenderRanks isequal to 1 at the end. Figure 5 shows some comparison between file types. 466 ratings. Machine-learning-assisted approaches are promising for device identification since they can capture dynamic device behaviors and have automation capabilities.