- 1. Introduction
- 2. Automatically get token and code
- 3. Connect to the Database
- 4. Search the tweets and followers
- 5. GUI interface
- 6. File split and word segmentation
1.1 Overview:
We had the honor to join in the student exchange program which held by UCLA and MPI. During that time, we were required to design a sina weibo crawler to fetch data from servers and store it in the database with the help of sina weibo API. The whole program should achieve the following functions. First of all, the crawler is able to get access token automatically and fetch data of some specified user IDs to push them into mongo database. Secondly, we can search all the followers of an ID. In addition, our program supports keyword searching which is useful to data analysis.Sina Weibo is a data center for all the related user data and tweets, we invoked weibo API to retrieve the id and tweets from each user by python. We put python script into crontab to make it run automatically every day.
1.2 Tools:
- Programming language: Python, Java, bash, awk
- IDE: IDLE, NetBeans, vim, bpython
- Database: Mongo database
This is the main python program which invokes other subprograms and log files to get token and save it to a file named token. Due to the limit of our account, we need to renew our account every day. As a result, this program is really useful to the crawler.
This is the configure file which is used to provide the authentication parameter, app key and the other related parameters to the main program.
When running this program, the Firefox will automatically show the page and input the username and password which contains in the configure file. After that, a file named token will be generated.
We use MongoDB as our database for two reasons: firstly, MongoDB is very easy to use and we don't need to care about SQL statements. Secondly, MongoDB relatively has higher performance over other DBMS.
This is the python code which is used to connect to the database. We use MongoClient as the connection tool. In the database, we create an idtable as the collection to store users’ id and their tweets.
We use mViewer to track the status of MongoDB. It has a GUI interface and could display the statistics of the database in form of a chart.
4.1 Search the tweets of a given id:
This function is used to search the tweets of a specific user by giving his or her user id. Firstly, we will search the tweets in our database, if we can’t find anything in the database, we will directly go to the weibo server to do searching.
For example, users can type an id number, if the number hit in the database. It will be displayed immediately, if not, the python program will try to search on the sina weibo database.
4.2 Search user ID and their followers:
The following code is used to search all the user ids that matches the numbers we input.4.3 Search the tweets that matches the words:
This code shows the function of searching all the tweets that contains the words. Due to the account limitation, we will only search the tweets in our database rather than wasting a lot of time searching them online.We build a java swing GUI program to display the result more clearly and friendly. It make our functions much easier to use.
You need to set the mongoDB server first in advance in order to fetch the data. Here, I set the server on my local virtual computer and do the port forwarding.
After click the button to search tweets, it will pop up a small window, you can type the user ID and all the tweets will be displayed on the text field.
6.1 File split:
AWK is very powerful pattern scanning and processing language. Combined with other bash commands such as paste , sed … Data splitting and formatting is quite an easy job.
Each file is formatted as uid:tweet,we leave a space in each tweet. So it becomes more convenient for the subsequent processing.
These is the expected effect after the split. Each file contains the tweets of an ID.
6.2 Word segmentation:
Our group wrote a program which is based on ANSJ Chinese Words Segmentation tool. The program can filter out words that are irrelevant to nouns such as verbs, punctuations, adjectives and so on. The relevant words such as the people’s names, address names and product names will be kept. Considering the dictionary of ANSJ does not contain the specific words in Macao such as names of Casinos, street of Macao, names of shopping malls and so on, we build a user dictionary for the segmentation program. Those are the effect after the word segmentation. Each sentence in each tweet is divided into separated words.