Portfolio

Lin Meng

Home About Me Introduction Data Gathering Data Cleaning Exploring Data Clustering ARM and Networking Decision Trees Naive Bayes SVM Conclusions infographic

Introduction

The topic of this project is exploring suicide to understanding what will lead to it and what behavior can decrease the rate of suicide. Suicide is a serious problem. A lot of people have tried to suicide before. Some of them may instantly regret because of that impulse. Although some of them have not been hurt too much, they mat still feel scared every time when they think about it. In this project, it may tak about tge factor can lead people to suicide. And some activity of preventing suicide. It this help other people and myself. There is ten question of this topic which is below.

1) Would suicide rate of people in different generation be different?

2) Is suicide rate different between male and female?

3) Can education affect the rate of suicide?

4) Is the rate of suicide different for people who live in different geolocation?

5) Is there economic factor which can affect the rate of suicide?

6) Is there relationship between psychic health and suicide?

7) Is alcohol will affect the rate of suicide?

8) Is there any relationship between Internet and suicide rate?

9) Is the rate of suicide lower for the countries which have higher life expectancy.

10) Is the Internet and activity will affect the suicide or suicide prevention?

In this project, there are 5 model to analysis the serval part of data of suicide with different factors separately to solve the above ten questions. There is more specific detail introduction and explanation of each model under below part.

First model is clustering. There is a numeric data and text data under clustering. The small part of data will be show in the below pictures. In the numeric data, use clustering to classifier the geolocation of each countries into serval types. Them compare the geolocation and suicide rate to find there is a relation between suicide rate and geolocation or not which is relate to the question 4. In the text data, use clustering to classifier the online news into several types. Then compare each type of news and month which news been published. To find out, there is a significant difference between the news in September or October which can help to find out the effect of suicide prevention month which is relate to the question 8 and question 10.

Second model is ARM (association rule mining) and Networking. The small part of data will be show in the below pictures. There is a text data of Twitter post. use ARM and Networking to divide each words of each post then check the relation of each in each post. It will talk about the Internet and activity of suicide prevention which is relate to question 8 and question 10.

Third model is Decision tree. There is a record data, a text data of twitter post, and a text data of online news. There are small parts of each data will be show in the show picture. In the record data, it will draw a tree of the generation, gender, and gdp_per_capita to find the factor which lead to the suicide rate, which is related to question 1, question 2, and question 5. For text data, it will talk about the Internet and activity of suicide prevention. It will find the difference key words in October and September which is relate to question 8 and question 10.

Fourth model is naive bayes. There are small parts of each data will be show in the show picture. There are two record data and a text data. For record data 1 which is in R code. It will talk about longitude, latitude, gdp_per_capita, and status of countries to find the factor which lead to the suicide rate, which is related to question 4 and question 5. For record data 2 which is in Python code, it will talk about life expectancy, alcohol, diphtheria, HIV/AIDS, and schooling to find the factor which lead to the suicide rate, which is related to question 3, question 6 and question 9. For text data, it will talk about the Internet and activity of suicide prevention. It will find the difference key words in October and September which is relate to question 8 and question 10.

There are small parts of each data will be show in the show picture. Firth model is SVM (support vector machine). There are two record data and a text data. For record data 1 which is in R code. It will talk about longitude, latitude, gdp_per_capita, and status of countries to find the factor which lead to the suicide rate, which is related to question 4 and question 5. For record data 2 which is in Python code, it will talk about life expectancy, alcohol, diphtheria, HIV/AIDS, and schooling to find the factor which lead to the suicide rate, which is related to question 3, question 6 and question 9. For text data, it will talk about the Internet and activity of suicide prevention. It will find the difference key words in October and September which is relate to question 8 and question 10.