R

用於選擇最佳狀態數的隱馬爾可夫模型方法

  • October 19, 2015

封裝 RHmm (R)

我有一個向量,我適合一個 hmm 模型,試圖為隱藏的馬爾可夫模型選擇最佳狀態數。

x<-c(-0.0961421466,-0.0375458485,0.0681121271,0.0259201028,0.0016780785,0.0311860542,      
0.0067940299,0.0126520055,0.0357599812,0.0007679569,0.0409759326,0.0560839083,-0.0272581160,-0.0439501404,0.0321578353,0.0196158110,-0.0097262133,-0.0226182376,0.0119897380,-0.0099522863,-0.0359443106,-0.0039363349,-0.0476283592,-0.0383203835,-0.0518624079,0.0187455678,0.0950535435,0.0057115192,-0.0307805051,-0.0272725295,-0.0254645538,-0.0102565781,-0.0267986024,-0.0482906267,-0.0256826510,-0.0414746754,-0.0470666997,0.0284912760,0.1021992517,0.0875572274,0.0064152031,0.0200731787,-0.0091688456,-0.0575608699,-0.0442028942,-0.0277449185,-0.0115369429,0.0084710328,0.0745290085,0.0159369842,-0.0784550401,-0.0934970644,-0.0978390888,0.0160188869,0.0275268626,-0.0552651617,0.0033928140,0.0468507896,0.0374087653,0.0521167410,-0.0177752833,-0.0592673076,0.0514406681,0.0847486437,0.0738066194,-0.0098354049,-0.0572274292,0.0478305465,0.0096885221,-0.0445535022,-0.0153455265,-0.0105375508,0.0100704249,-0.0035215994,0.0243363762,0.0504443519,0.0570023276,0.0395103033,-0.0612817210,-0.0557737453,-0.0273657697,-0.0220077940,0.0083501817,0.0275081574,0.0323161331,0.0385741087,0.0175820844-0.0410599399,-0.0071019642,0.0431060115,-0.0107360128,-0.0007280372,0.0360799385,-0.0061620858  0.0164458899 -0.0050461344 -0.0578381588  0.0097198169  0.0027277926 -0.0127642317,
-0.0037062560, -0.0045482803,  0.0367596953, 0.0021176710,-0.0319243533,-0.0194663776,0.00 91915981,0.0061495737,-0.0090424506,0.0127655251,0.0161735008,0.0193814765,-0.0208605478,-0.0598025722,0.0022554035,0.0473633792,0.0247213549,-0.0063206694,-0.0201626938,0.0207952819,0.0379032576,0.0151612333,0.0038692090,0.0111271847,0.0497851603,0.0273431360,-0.0172488883,-0.0038909126,0.0264670631,-0.0065249612,-0.0467169856,-0.0255090099,0.0082489658, 0.0352569415,0.0272149172,0.0074228928,-0.0040191315,-0.0170611558,-0.0309531801,-0.0327952044,-0.0239372287,-0.0212792531,-0.0132712774,0.0086866983,-0.0007553260,0.0107026497,0.0065106253,-0.0321813990,-0.0081734233,0.0296845524,0.0268925281,-0.0025994962,-0.0038915206, -0.0126335449,0.0040244308,0.0227324065,0.0114903822,-0.0031516422,0.0031563335,0.0137143092,0.0026222849,0.0035802606,0.0111382363,-0.0008037881, -0.0282458124, 0.0056121633, 0.0254201390,0.0033781147,-0.0166139097,-0.0124559340,0.0088520417,0.0072600174, -0.0050320069,-0.0114740312,-0.0066160556, -0.0042080799, -0.0205501042,0.0027078715,  0.0122158472,-0.0206261771,-0.0267682015,-0.0107602258,0.0088477499,0.0165057256, 0.0106637013,0.0115216769,0.0278296526,0.0026376283,-0.0231543960,-0.0141964203)

#partitions test/train
nhs <- c(2,3,4) #number of possible states
S<-runif(length (x))<= .66
train<-print(S)

# mean conditional density of log probability of seeing the partial sequence of obs 
for(i in 1:length(nhs)){
pred <- vector("list", length(x))
   for(fold in 1:length(x)){
       fit <- HMMFit(x [which(train==TRUE)],dis="NORMAL",nStates=nhs[i],
       asymptCov=FALSE)
       pred[[fold]] <-  forwardBackward(fit, x[which(train==FALSE)])
  }
error[i] <- pred[[fold]]$LLH
 }
nhs[which.max(error)]    # Optimal number of hidden states (method max log-likehood)

每次我執行模型試圖獲得隱藏馬爾可夫模型的最佳狀態數時,我都會得到不同數量的狀態,因為我相信模型是通過隨機選擇的新值和局部最小值進行訓練的。如果我只是擬合模型,則不會發生這種情況。

#score proportional to probability that a sequence is generated by a given model
nhs <- c(2,3,4)
for(i in 1:length(nhs)){
   fit <- HMMFit(x, dis="NORMAL", nStates= nhs[i], asymptCov=FALSE)
   VitPath = viterbi(fit, x)
  error[i] <- fit[[3]]
}
error<-c(error) 
error[is.na(error)] <- 10000
nhs[which.min(error)]    # Optimal number of hidden states (method min AIC)

然而兩者的結果卻大相徑庭。哪一個更好?一方面,我有一個模型,我可以在其中測試新樣本。另一方面,第二個提供了對所見樣本的最佳擬合。在模型的情況下,如果我重複測試,因為訓練/測試集發生變化(隨機),結果狀態數會隨著樣本訓練/測試的變化而變化。在這種情況下,我應該使用什麼方法來確定模型提供了泛化(狀態數結果最好)。

我可以採用哪些其他方法來選擇最佳狀態數

非常感謝

確定 HMM 中的最佳狀態數確實是一個複雜的問題。

請看下面的論文:

跨資產回報的製度數量:M. Gatumel 和 F. Ielpo(2011 年)的辨識和經濟價值

從摘要:

金融業的一個共同信念是,市場是由兩種制度驅動的。牛市的特點是高回報和低波動性,而熊市將顯示低迴報和高波動性。使用馬爾可夫轉換模​​型和基於密度的檢驗對不同資產類別(股票、債券、商品和貨幣)的動態進行建模,我們拒絕了以下假設:兩種制度足以捕捉許多被調查的資產回報的演變資產。一旦通過蒙地卡羅實驗評估了我們測試方法的準確性,我們的經驗結果指出,需要兩到五種制度來捕捉每種資產分佈的特徵。而且,我們表明,只有一部分潛在的製度是由回報的分佈特徵(如峰度)來解釋的。徹底的樣本外分析提供了額外的證據,表明金融市場中不僅僅是多頭和空頭。最後,我們強調考慮到製度的實際數量可以提高投資組合回報和密度預測。

另一方面,您必須考慮在確定這些狀態後如何處理它們。通常你想把交易策略放在上面。所以如果你有五個州……現在呢?我的座右銘是保持簡單,所以在大多數情況下,我仍然只使用兩種狀態,因為這非常直覺地轉化為長和平。

如果您選擇三個狀態,您通常會發現一個狀態是一個非常短的崩潰狀態,它擷取了所有極端(左)尾部。你不能真正使用它,因為一旦你實時辨識出這種崩潰狀態就已經太晚了。您通常可以使用兩種狀態模型更好地躲避這些尾巴。

但我也很想听聽其他的經歷!

除了使用其他答案中給出的有價值的領域知識來確定最佳狀態數,此外,作為使用 AIC 或 BIC 的替代方案,我們可以考慮對參數進行貝氏估計,包括最佳狀態數:

斯科特,SL(2002 年)。隱馬爾可夫模型的貝氏方法:21 世紀的遞歸計算。J.阿米爾。統計學家。副教授。97, 337–351。

康登,P.(2006 年)。基於後驗模型機率的蒙特卡羅估計的貝氏模型選擇。計算。統計學家。與數據分析 50, 346–357。

引用自:https://quant.stackexchange.com/questions/11045