Log-Linear Model on Categorical Data of HIV Cases

Abstrak. Data kategorik banyak digunakan pada penelitian sosial, kesehatan, pendidikan, dan psikologi. Tabel kontingensi adalah bentuk penyajian data tersebut. Salah satunya tentang kasus terinfeksi virus HIV. Model log-linear menjadi alternatif untuk menganalisis data ketegorik. Pada penelitian ini akan dianalisis menggunakan model log-linear kasus HIV yang dikelompokkan berdasarkan jenis kelamin, Age dan Province. Selain itu akan dibentuk beberapa model log-linear dan dipilih model terbaik berdasarkan uji statistik likelihood ratio ( 𝐺 2 ) . Menurut hasil analisis dan pertimbangan kompleksitas model, ( JK*P, JK*U, P*U ) merupakan model terbaik dan sesuai dengan data karena p-value =0.517 lebih besar daripada taraf nyata 𝛼 =0.05. Artinya interaksi antara jenis kelamin, Age dan Province adalah signifikan. Studi


INTRODUCTION
Research in the fields of social, health, education and psychology often uses qualitative variables.The data collected is in the form of categorical data presented in the form of a contingency table.In statistics, you can use regression analysis or ANOVA to analyze relationships between variables, but in fact the regression principle is not capable of analyzing categorical data.In its development, the log-linear model emerged to overcome the limitations of the regression model (Carota et al., 2022;Mulugeta et al., 2022).
The concept of log-linear analysis in contingency tables is analogous to the concept of analysis of variance (ANOVA) for continuously distributed factor response variables (Altun, 2021a;Von Eye et al., 2012;Wang et al., 2022).Several researchers have applied the loglinear model to categorical data, including Altun (2021b) who conducted a multi-rater agreement analysis with a log-linear model.Carota et al. (2022) assessing semi-parameteric log-linear Bayesian models: application to the expression of risk estimates.Apart from that, Ali et al. (2021) researched investigating the interaction between age and gender on the influence of car users using a log-linear model: a Bayesian inference approach.Grover & Sharma (2018) also examined the effect of reducing predictors that influence the survival time of HIV/AIDS patients using log-linear.Some of these studies obtained the best model by conducting hypothesis tests based on goodness of fit values, namely likelihood ratio and chi-quare values.
One case that uses qualitative variables and categorical data is the HIV case.Human Immunodeficiency Virus (HIV).HIV is a virus that attacks and destroys the body's CD4 cells that fight infection in the immune system (Du et al., 2022;Grover & Sharma, 2018).The number of people infected with the HIV virus in Indonesia in 2020 was 543,100 people and 30,137 people were recorded as having died due to the virus (Kementerian Kesehatan Republik Indonesia, 2021).Overall, there are fewer HIV cases in Indonesia than men, with an age range of 25-49 years.

METHOD
The secondary data used is the number of HIV cases based on gender and age group from three provinces, namely Maluku, North Maluku and West Papua in 2020.The data source is from the publication of the Ministry of Health of the Republic of Indonesia in 2020.The analysis method used is a log-linear regression analysis, with research numerical results obtained using IBM SPSS 25.0 software.Three-dimensional cross table obtained according to gender (male=1, female=2), Province (Maluku=1, North Maluku=2, West Papua=3) and age (≤ 4 years = 1, 5-14 years= 2, 15-19 years=3, 20-24 years=4, 25-49 years=5, and ≥50 years=6).The best model was determined using the backward elimination method.After estimating the best model parameters.
When the case is more than two categorical variables, the use of the chi-square test of independence in determining the relationship between the variables in a contingency table becomes difficult or sometimes impossible.In this case, the log-linear model, which allows testing a much larger number of hypotheses compared to chi-square, which does not impose restrictions on the number of rows and columns in both two-dimensional tables where chisquare can be applied, and three-dimensional tables where chi-square is insufficient, it is preferred.Multidimensional contingency table in the log-linear model, a model is formed to investigate the relationship between variables, the parameters in the model are estimated and the significance of this model is tested (Abdallah, 2022;Fujisawa & Tahata, 2022).Overall model fit was assessed by comparing the expected frequencies with the observed cell frequencies for each model (Alzahrani, 2022;Asmare & Agmas, 2022).
Pearson's chi-square statistic or likelihood ratio can be used to test model fit.The likelihood ratio is more commonly used because it is a statistic that is minimized in maximum likelihood estimation (Agresti, 2003;Jamaludin et al., 2022).The test statistical forms of likelihood ratio and chi-square are respectively written as follows: If the model has a large number of observations then  2 and  2 approach the chi-square distribution, where the number of columns and rows minus the number of free parameters that fit in the model will be equal to the degrees of freedom (Aliverti & Dunson, 2022).The criterion for testing the hypothesis is that if the calculated  2 is greater than the table or the p-value is smaller than the significance level (α = 0.05), then Ho is rejected (Maryana, 2013).

RESULT AND DISCUSSION
Below is a three-dimensional contingency table for the number of HIV cases based on gender, age and province.1. is examined in terms of gender, it can be concluded that this disease occurs more often in women.This is explained further with the following diagram illustration.Figure 1 shows that the number of HIV cases for both men and women is quite high, but women dominate.When examining the age of individuals infected with the HIV virus, it appears that individuals aged between 25-49 years are more at risk of developing the disease.It might be thought that the reason for this is that individuals are in the productive age range, where individuals tend to engage in promiscuous sex, sharing needles, contagion during childbirth, and blood transfusions.This is more dominantly experienced by men.
Meanwhile, Figure 2 shows that in the same age range, namely the 25-49 year age range, the number of HIV cases is quite high for the provinces of Maluku, North Maluku and West Papua.However, there is a slight difference between the three provinces, with those aged 20-24 years having the lowest number of HIV cases in North Maluku.It can be concluded that the age range that is least affected is individuals in the 5-14 year age group.The reason for this can be predicted as the fact that there is a small possibility that these people do things that can cause infection with the HIV virus.
When viewed based on gender and province, West Papua is inversely proportional to Maluku and North Maluku.Women are more likely to be infected with the HIV virus than men.Apart from that, among the three provinces, the number of people infected with the HIV virus is dominated by women in West Papua.
Based on this HIV case, the hypothesis proposed is: Ho: There is no relationship between gender and age H1: There is a relationship between gender and age Ho: There is no relationship between Province and age H1: There is a relationship between Province and age Ho: There is no relationship between gender and Province H1: There is a relationship between gender and province The relationship between variables will be analyzed using the chi-square test statistic.The following are the results of data processing using SPSS.The table above explains that with a value of  2 = 29.636and p-value = 0.000, statistically reject Ho because the p-value is less than the significance level α=0.05, which means that in the case of the number of HIV there is a relationship between gender and age.Then with 10 degrees of freedom,  2 = 22.987 produces a p-value of 0.011 so reject Ho, meaning there is a relationship between Province and age.Likewise, gender and Province have a  2 of 29.431 and p-value = 0.000, reject Ho, so there is a relationship between gender and Province.This fact shows a tendency for a relationship between gender, province and age in HIV cases.Contingency tables with three dimensions can be created into 7 different log-linear models.We can find out the complexity of the model by examining the model based on the Goodness-of-Fit test.The models (JK, P, U), (JK, P, U, JK*P), and (JK, P, U, JK*U) are not suitable for describing the data because they have a p-value that is smaller than the level real α=0.05,and the value of   is very large, so it is not significant.This means that the alternative hypothesis is accepted which states that the model is not appropriate.Furthermore, for the model (JK*P, JK*U, P*U), each variable has a two-way relationship with the smallest   value, namely 9.165 and p-value = 0.517.Based on the test criteria, it is clear that there is a tendency not to reject Ho, which is significant at the real level α=0.05 because it is smaller than the p-value of the model.So, the model is a suitable model to describe the data.
Meanwhile, the models (JK, P, U, P*U), (JK, P, U, P*U, JK*P) have very large G^2 values, namely 66,799 and 37,179.The hypothesis test value for both models is rejecting Ho because the p-value for both is less than α=0.05, so the model is declared inappropriate.This is different from the model (JK, P, U, JK*P, JK*U) with G^2= 28,680 and p-value = 0.094, which can be considered because it does not reject Ho, meaning the model is appropriate.However, due to consideration of the complexity of the model, this model was not chosen as the best model.Based on the Goodnes-of-Fit hypothesis test analysis, the model (JK*P, JK*U, P*U) is the best and most appropriate model for describing the data.Below is a table showing the goodness of fit results for the best model using multinomial.As can be seen in Figures 4 and 5, the distribution between the observed and expected data is very similar, meaning that there is a good match between the estimation results using the loglinear model and the HIV case data.This is reinforced by the residuals which spread normally around the estimator line, so that the model is very appropriate for analyzing this data.
It has been determined that the JK*P, JK*U, P*U models are the best models, so it can be concluded that the interactions between gender and province, gender and age, and province and age are statistically significant.The risk of being infected with the HIV virus in this study was shown to be highest between the ages of 25-49 years.If we monitor by gender, women are more at risk of being infected with the HIV virus than men with a total of 513 people from three provinces.Meanwhile, based on province, the one with the highest number of cases infected with the HIV virus is West Papua, namely 405 people.

CONCLUSION
In the case of HIV, 1013 patients infected with the HIV virus were grouped according to gender, age and province.The data used is categorical data, analyzed using a log-linear model.According to the results of the analysis and consideration of model complexity, (JK*P, JK*U, P*U) is the best model and fits the data because the p-value = 0.517 is greater than the real level α = 0.05.This means that the interaction between gender, age and province is significant.Studies and explanations about the HIV virus show that individuals between the ages of 25-49 years are more at risk of being infected with the virus.Examined by gender group, women were most infected with the virus, namely 513 people.Apart from that, West Papua is the province with the highest number of HIV infections compared to Maluku and North Maluku.

Figure 1 .
Figure 1.Number of HIV cases by gender and age Figure 2. Number of HIV cases by gender and age

Table 1 .
Number of HIV cases by gender, age and Province

Table 2 .
Test statistics for analysis of relationships between variables

Table 3 .
Log-linear model with main effects and interactions between variables