Social Media Information Extraction: December 2020

Introduction:

In today’s world, social media one of the most popular and fastest information-bearing services. The leading social media like Twitter and Facebook produce a huge volume of information through their microblogs. The number of microblogs produced every minute by these services is so large that the old information uses to be buried. sometimes the messages of these services are not complete, structured, and relevant to the topic that time for it becomes difficult for the users to find the exact information they are searching for. For example, if someone wants to know the popularity of Indian Prime Minister Mr Narendra Modi before, during, or after the election, then microblogging services can be an easy and highly available source of his popularity information. So, we have tried to explain an efficient way of extracting messages containing the topics/subtopics of users' interest. While popularity in social networks is raising the field of social network analysis has an interesting area in the study.it is the process of exploring social structures through networks and graphs. for making valuable information we are converting unstructured information into structured information. Natural language processing is used to enhance the accuracy in visualizing structured information on social networks. There have been lots of changes in traditional business now everything is conducted online from product development to marketing. The technology area is growing rapidly pace leading to the formation of new sophisticated tools for text terms. data mining techniques are designed to handle the voluminous data sets to significant patterns from data.

NATURAL LANGUAGE PROCESSING (NLP) :

this blog analyses excessive use of Natural Language Processing and web mining techniques to study Social Network. NLP techniques map human language to machine language. and simply searching for one word is not a good method in Social communication. Therefore one can observe that Social Network monitoring is to extract and interpret ‘User Communication’. Some NLP methods with statistical techniques to ensure the extracted information to be correct and precise.

A) Automatic Summarization :

Automatic Summarization is the process of reducing a text document with the help of a computer program for creating a summary that retains the most significant points of the original document. Technologies can make a coherent summary take into account variables such as length, writing style, and syntax. The main agenda of summarization is to find a representative subset of the data, which contains a summary of the entire set.

Two approaches of summarization: Extraction and Abstraction.

1.Extraction refers to selecting a subset of existing words, phrases, or sentences in the original text to form the summary.

2.abstraction builds an internal semantic representation and then uses natural language generation techniques to create a summary that is closer to what a human might generate.

In the analysis, a fluent summary of the most significant information is produced in the input. It requires the capability to reorganize, modify, and merge information expressed in different sentences in the input. Transformation is an ordered text is generated by manipulating the internal representation. An analysed summary text is generated using scores of transformation in the Realization phase. The process of Auto Summarization is depicted in Figure 1.

B) Chunking :

Chunking is the technique used for entity detection. Chunking selects a subset of the tokens rather than tokenization that omits whitespaces. The pieces formed in the source text do not overlap as the output of tokenization. It is easier to describe what is to be excluded from a chunk. It segments the tokens. A chink defined as a sequence of tokens that is not in a chunk. Removing a sequence of tokens from a chunk is called Chinking. The whole chunk is removed if the matching sequence of tokens spans an entire chunk. However, the tokens are removed, leaving two chunks where there was only one before; if it appears in the middle of the chunk.

C) Parts-of-Speech Tagging:

it is a piece of software that reads the text in any language and assigns parts of speech to each word such as noun, verb, adjective to name a few. computational applications utilize more fine-grained Parts of speech tagging include tags like 'noun-plural'. Dictionaries have categories of a particular word which implies that a word may belong to more than one category. Taggers employ ‘Probabilistic Information’ to solve this ambiguity.

D) Named Entity Recognition :

It is a subtask of information extraction that seeks to locate and classify Named Entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. For an instance, Robert bought 500 shares of Accenture Corporation in 2008. in this, a person name consisting of one token, a two-token company name, and a temporal expression have been detected and classified. Hand-crafted grammar-based systems obtain better precision. In the current, statistical models are initially uses training data against the model, followed by the preparation of statistics. These statistics are used against real documents. Moreover, it is also employed in libraries and Java platforms to identify names and Entities. For Example, the newsfeed “enjoying U.S. weather at Texas with MonaLisa” will extract entities like weather, Texas, and Monalisa. Extracting Information from Social Network using NLP.

E) Named Entity Disambiguation:

The task of linking the identity of entities available in the text is referred to as Named entity disambiguation. However, it is distinctive from named entity extraction as it identifies not the occurrence of names but their reference. It needs a Knowledge Base of entities to which names can be linked.

F) Fact/Relation Extraction:

Once named entities have identified in a text, we can then extract the relations or facts that exist between specified types of the named entities. The objective of the fact extraction is to detect and distinguish the semantic relations between entities in text.

G) Word Sense Disambiguation :

It identifies the correct sense of the word in a sentence where multiple meanings of the word exist. It’s easy for a human to understand the significance of a word based on the basis of its background knowledge of the subject. identification of the aspect of the word is difficult for a machine to understand. This methodology provides a mechanism to diminish the ambiguities of words in the text. e.g. Word Net is a free lexical database in English that contains a large collection of words and senses

H) Sentiment Analysis:

It is a process that identifies, extracts, enumerates the attitude of the user to the information that is provided by the user in a free form text. A text collection has various sentiments which can be positive, negative or neutral. It is extensively used in processing survey forms, online reviews, and social media monitoring. It returns the identified sentiment with a numeric score from 1.0 to -1.0 where 1.0 means strongly positive and -1.0 means strongly negative. For Example, “I love it” with a score of 0.8 means a strongly positive analysis for the newsfeed or blog.

OPEN SOURCE NLP LIBRARIES :

Apache OpenNLP: It is an open-source machine learning toolkit that provides natural language text. It provides services like tokenizers, summarization, searching, part-of-speech tagging, named entity extraction, translation, information grouping, natural language generation, feedback analysis, and more.

Natural Language Toolkit (NLTK): It is a leading Python library that provides modules for processing text, classifying, tokenizing, stemming, semantic reasoning, parsing, and more.

Standford NLP: It is a suite of NLP tools that provide part-of-speech tagging, the named entity recognizer, coreference resolution system, sentiment analysis, and more.

MALLET: It is a Java package that provides Latent Dirichlet Allocation, document classification, clustering, topic modelling, information extraction, and more.

3. CHALLENGES IN NLP :

Informal language: Social Network users Poste texts in an informal language which is noisy include lack punctuation, misspellings, use non-standard abbreviations, capitalization.
Part-Of-Speech tags make the Information Extraction from social network more challenging.
Short contexts: Social Networks poses minimum length like Twitter. It is difficult to disambiguate mentioned entities due to the shortness of the posts and to resolve co-references among the feeds.
Noisy sparse contents: The users’ post on a social network does not always contain useful information to purify this we required filtering.
Uncertain contents: Not all information is trustworthy on the social network. Information contained in the users’ contributions is in conflict with other sources and sometimes untrustworthy.

4. TEXT MINING :

Text mining is the employment of data mining techniques that automatically extract information from unstructured text documents and services. NLP is doing extract meaningful information from the text. searching using large-scale databases with the help of traditional data mining commonly known as warehouses. but for analysing social network information through a smart way is text mining. it is a way of retrieving and searching on a social searching engine that mainly searches user-generated content such as news, videos, and images.text mining consist of 4 following steps:

Data collection
Pre-processing
Generalization
Analysis

Data collection: This is the process of gathering and measuring information in a systematic manner, which then enables one to answer relevant questions and evaluate outcomes. It deals with the challenge that updated information. There are huge numbers of users who access historical data at a particular time and it becomes difficult and expensive for a social network to gather a large amount of data. So, Summarization maintains all important data and further discards the insignificant data.
Pre-processing: This step refers to the processing of raw data to deliver a podium for data analysis. The significant purpose of this step is to classify raw sentences into sentences that can be read by the machine. The text is cleaned and delimiters are removed with the help of some pre-known list of stop words that are not useful to classify the meaning of the sentence.
Generalization: This step involves the multiple patterns in the text of the pre-processed texts. It deals with developing algorithms to ascertain stimulating, unforeseen, and unusual information form the patterns in the text document. One of the common tasks that occur is referred to as Apriori. Frequent behaviours of persons or entities are recognized in the dataset. It identifies the inherent regularities in the data.
Analysis: It deals with the validation and interpretation of the generalized data pattern. Density, Centrality, indegree, outdegree, and sociogram are the major terminologies to analyse the social network. Degree identifies the “connections” between the users. Centrality determines the behaviour of the individual users in the associated network. Indegree and outdegree are the measures of centrality.

Applications of Text Mining in Social Network :

1) Keyword Search: A set of keywords are used to identify the social network nodes which are close to the query result. Content and Linkage behaviour plays an important role in order to determine the query output. Query Semantics, Ranking Strategy, and Query Efficiency are the major concerns to perform keyword searches.

2) Classification: The nodes in the social network are associated with labels that are used for classifying the network. There are numerous algorithms available for the classification of text from the content.

3) Clustering: It is the area where a set of nodes are used to determine the similar content for the evolution of clusters. There are various clustering algorithm has been proposed which uses deviations of multi-dimensional data clustering techniques. K-means is a widely adopted technique.

4) Linkage based Cross-domain learning: The linkage information between multiple domains of social networks provides the transfer of knowledge across various kinds of links. The major concern in this learning is the amount of training data available from multiple social networks.

5. CONCLUSION

NLP techniques for Social networks that can enhance the experience of the user in a more interactive way. Traditional text mining techniques are not popularly used in social network monitoring. The combination of text mining and web mining techniques should be incorporated to analyse a social network monitoring system. NLP Techniques will enhance a user-friendly search by the Social Network user while text mining encompasses the intelligence in the Social Network.

Social Media Information Extraction

Wednesday, December 23, 2020