none
Build a predictive model for a movie rating prediction

    Question

  • Hi.

    I downloaded data from IMDb database (movies, actors, ratings, directors etc.) and now I'm building a model for the movie rating prediction. But I'm not sure how to make a model for the prediction. How to determine which algorithm is right for my data. I know that it would be regression, but I don't know which exactly. So I tried a few of them, but results are not good (and finally my approach is definitely not right :) ).

    I started with only rating database where prediction takes about 2 minutes. When I tried to add another dataset with actors to refine the results, it took more than 11 hours. So I definitively need sophisticated approach to choose relevant algorithm and data :)

    Is there any sophisticated method to determine which regression algorithm should I use? How to proceed in this case?

    Thank you for your help.

    Tuesday, February 17, 2015 9:40 AM

Answers

  • Now when I have all datasets converted into the recommended form where actor, director, ... is replaced by the average rating of the movies he's appeared in, and the movie release year is transformed into the comparable and sortable numeric value, I should save the results as a new datasets and use it for the movie rating prediction instead of the original datasets, right?

    So I tried to build a new model with these new datasets.

    And the result looks MUCH better I think :)

    But now I have in my head a few improvements.

    1. As Yordan wrote above, how to use a year to improve the prediction?
    2. I have available a number of votes for each movie. Few votes (e.g. 5) means less credibility of the rating and a lot of votes (e.g. 50.000) means higher credibility of the rating. How to include it into the prediction?

    Thursday, March 05, 2015 11:13 AM

All replies

  • very interesting scenario, what algorithm are you using currently that is taking 11 hours?

    Vishal Narayan Saxena http://twitter.com/vishalishere http://www.ogleogle.com/vishal/

    Tuesday, February 17, 2015 1:31 PM
  • Boosted Decision Trees Regression.

    But I just tried all algorithms with only with rating dataset. The time was approximately the same, but the best results gave me Boosted Decision Trees Regression and Fast Forest Quantile Regression (Quantile Loss:0.750) regarding to Evaluate Model module (Mean Absolute Error and Root Mean Squared Error).

    Still I don't know how to determine which algorithm to use and which dataset and features to use...

    Tuesday, February 17, 2015 1:47 PM
  • You could try the Matchbox Recommender modules, they're designed for this kind of problems.

    Here's a video tutorial.

    Hope this helps, Roope

    Tuesday, February 17, 2015 4:59 PM
    Moderator
  • I know, but I cannot use it probably. Because it's designed for user-item-rating triples and I don't have available dataset of users and their ratings, I have only dataset with Movie ID, rank and number of votes.

    Or how can I use it with Matchbox recommender?

    Thank you.


    • Edited by Lukas Beran Tuesday, February 17, 2015 5:39 PM
    Tuesday, February 17, 2015 5:38 PM
  • Got it. In this case you could use the Ordinal Regression module, which is basically a module that turns a binary classifier into a rating prediction.

    Note that you probably want to start with a small training set when using Ordinal Regression. It works by training N binary classifiers that make predictions "Is rating of movie X at least Y?", one classifier for each possible rating value. 

    Tuesday, February 17, 2015 5:46 PM
    Moderator
  • I tried it. It had been running for 11 hours. And the result is really bad - just one value for all rows... I guess I made something wrong :)

    Wednesday, February 18, 2015 7:26 AM
  • I'm guessing that the SVM + Ordinal Regression is just snapping into an average rating of all movies.

    Perhaps this is a problem of constructing a good set of features. Could you describe the movies dataset in more detail? You mentioned it has features like actors and directors. How many possible unique values are there, and how are they represented in the dataset?

    Thanks, Roope

    Thursday, February 19, 2015 1:09 AM
    Moderator
  • Ok, no problem. I know that machine learning and prediction is mainly about choosing appropriate set of features. So really thank you for your interest.

    I have these datasets available:

    Basic (I think default) dataset is ratings which contains columns movieid, rank and number of votes.

    Dataset movies contains movieid, title (name of the movie) and year (release year). This dataset I use only for connecting the ID of the movie with its name.

    Dataset actors contains actorid, name of the actor and sex.

    Dataset movies2actors contains movieid, actorid and as_character (information about what person he/she plays in the movie, ie "man in car", "Chico", ...). This dataset is designed for connecting which actors play in which movies. Like actorid 1, movieid 5, as_character person in car; actorid 1, movieid 16, as_character himself; actorid 3, movied 87, as_character Chico; ...

    Dataset directors contains directorid and name of the director. It is just a database of directors.

    Dataset movies2directors contains movieid and directorid. Connects movies with directors. (movieid 1, directorid 5; movieid 86, directorid 98; movieid 61, directorid 5)

    Dataset composers contains composerid and name of the composer.

    Dataset movies2compoers contains movieid and composerid. Principle is the same as movies2directors.

    Dataset costdesigners contains costdesid and name of the costume designer.

    Dataset movies2costdes contains movieid and costdesid. Principle is the same as above.

    Dataset countries contains movieid and country, where the movie had been shot.

    Dataset distributors contains movieid and name of the distributor.

    Dataset editors contains editorid and name of the editor.

    Dataset movies2editors contains movieid and editorid.

    Dataset genres contains movieid and genre of the movie (like Horror, Drama, Comedy, ...).

    Dataset language contains movieid and original language of the movie (English, Italian, Czech, ...).

    Dataset locations contains movieid and location, where the movie had been shot (movieid 2, location New York City, New York, USA; movieid 21, location Spiderhouse Cafe, Austin, Texas, USA; movieid 21, location Barton Springs, Austin, Texas, USA; ...).

    Dataset prodcompanies contains movieid and name of the production company. (movieid 350, name Warner Bros. Television [us])

    Dataset producers contains producerid and name of the producer.

    Dataset movies2producers contains movieid and producerid.

    Dataset writers contains writerid and name of the writer.

    Dataset movies2writers contains movieid and writerid.

    Hope it's understandable :)

    Thursday, February 19, 2015 8:49 AM
  • Firstly try removing the ordinal regression to see if that's causing the perf issue. Just go for standard regression on the ratings in the beginning.

    Another issue might be that you end up with a large number of (sparse) features after you do the join. Note that we'll convert string features into indicator arrays by default. So if a string feature (like any name basically) contains a hundred thousand different values, then you'll have a hundred thousand different features. In order to cope with this you need to either get rid of the string features or apply some feature hashing.

    -Y-

    Thursday, February 19, 2015 2:54 PM
  • Hi Yordan.

    Thanks for your reply.

    When I replaced the ordinal regression by the linear regression, it took less than two minutes with the following result

    Uff, ok. I'm not so versed in the machine learning problem, I'm just a beginner. Please, do you have any suggestion for me?

    //EDIT: I tried just to be sure run again the prediction with SVM + ordinal regression and now it took only two minutes. Maybe because I run it again (and the last result is cached or so)? Or because now ML is not preview anymore?

    • Edited by Lukas Beran Friday, February 20, 2015 9:08 AM
    Friday, February 20, 2015 8:51 AM
  • Hi.

    I tried to use the Two class average perceptron with the ordinal regression and now the result is much better.

    And evaluation

    Could it be better? :) Or how to improve my prediction?

    Tuesday, February 24, 2015 8:51 AM
  • Hi Lukas,

    The label column has 100 categories, that's why the ordinal regression is slow. I'd rather stick with regular regression in this case.

    I don't quite understand why the table that you show has so few features. What happened to all of the rest? I thought they were joined in... There is very little variation in the predicted labels, which indicates insufficient feature information. 

    What's the type of the Year of release ("Rok vydani") column? It seems string to me, so you should note that it'll get converted to indicator arrays (that is, a separate feature for each value). In this case you have to make sure that you have enough instances for each year of release, because you no longer take advantage of the fact that the feature value is ordered. That is, the model can no longer see a movie as "old" or "new" (since the feature is nor numeric).

    -Y-

    Wednesday, February 25, 2015 2:35 PM
  • Hi Yordan.

    I didn't add other features (datasets) to the prediction because I am not sure which features are suitable and which are not. And when I tried to add actors, prediction took more than 11 hours, and it was just with actors... So I think I need to select suitable features and than add them to the model. And this is the problem, because I don't know which features are suitable and which are not.

    Ok, good remark with the Year of release. Maybe I can convert it to the number. But in my opinion, year of release is not suitable at all, so I can probably remove it. Or not?

    Wednesday, February 25, 2015 3:46 PM
  • I think year of release is an indicative feature, and although it might be a weak signal, it probably contributes to the label value (maybe more in conjunction with other features - note that the early Robert De Niro movies are wildly different from his late ones). But purely predicting a rating based on this features is rather a no-go.

    The process of feature engineering is usually fairly involved and time consuming. Thus, I'm not sure we have the bandwidth to help with this (if anyone disagrees, please jump in). I'd suggest to keep on increasing the complexity of the feature set until you find the right balance of training time over accuracy. For example, try adding country of origin and distributor (I assume these don't have so many unique values). Then add the director, then one writer, then one actor, then all actors, and so on.

    -Y-

    Wednesday, February 25, 2015 5:17 PM
  • Ok, really thank you for your help.

    I will add the year of release back when I repair it a convert it to numbers :)

    Now I tried to change the ordinal regression to the boosted decision trees regression and I've added actors and directors dataset.

    I don't know why, but now it took only about 5 minutes. And the result is

    Wednesday, February 25, 2015 6:08 PM
  • If you have a table of movie ratings vs actors available, one trick is to replace the actor by the average rating of the movies he's appeared in. Basically, a movie that has known good actors has a better chance of getting a good rating. Also, this avoid the problem of dealing with high-dimensional string features. A row in your training set would then look like

    movie rating, average rating for actor 1, average rating of actor 2,...

    You should be able to do this kind of transformation by using Replace Discrete Values module, or by Python, R or SQL script modules.

    Hope this helps,

    Roope

     
    Wednesday, February 25, 2015 6:13 PM
    Moderator
  • Ok, that sounds great. But I am little bit lost... I've created new experiment for this.

    Join is Inner join, projected columns are movieid, rank and actorid. Metadata editor changes rank and actorid to the categorical. Replace discrete values has selected discrete columns - actorid and replacement columns - rank. This is the result

    What I made wrong? And how to make it like you said with columns as average actors ratings? It would be different kind of join, which joins it not to the new row for each instance, but to the new column of the movie.

    Thursday, February 26, 2015 6:59 PM
  • It seems that the Replace Discrete Values module is not the best tool for the job after all. But by using Apply SQL Transformation module in few steps you should be able to massage the dataset into desired format:

    1) Start from the table you created that has movieid, rating and actorid. Split it into training and test sets, for example all movies before year X should go into training set.

    2) Apply SQL Transformation to training half of dataset: "select actorid, AVG(rank) as actor_avg from t1 group by actorid;" This computes the averages for each actor.

    3) Join module with left outer join between the movies2actors table and the new "actorid, actor_avg" table using actorid as join key.

    4) Clean Missing Values module to replace missing values of actor_avg, as it is possible that some actors didn't appear in training half of dataset. For example, an average rating of all movies could work as missing value replacement.

    5) Another Apply SQL Tranformation to create feature columns that represent the quality of actors that appear in given movie. For example "select movieid, MAX(actor_avg) as best_actor, AVG(actor_avg) as avg_actor, MIN(actor_avg) as worst_actor from t1 group by movieid;"

    The result should be a dataset with columns "movieid, best_actor, avg_actor, worst_actor". You can use different aggregations to create different feature vectors, and you can apply the same procedure to director or producer quality too,

    Note that step 2 should be done using the training dataset only, otherwise information about the ratings you're trying to predict will leak into the test set, and the evaluation metrics will look too good to be true.

    It's quite an interesting case of feature engineering. Hope this helps,

    Roope

    Friday, February 27, 2015 12:19 AM
    Moderator
  • Thank you for detailed step-by-step guide.

    Result looks good. I've made it for the directors and actors. When transforming directors, it's not necessary to have avg, best and worst rating, because a movie has only one director, so I've changed the SQL transformation for that. And I'm completely removing entire row in the Clean Missing Data module.

    So do you think that this approach is good for the movie rating prediction? Or could it be better?

    What did you mean with the rating leakage into the test set? I'm splitting the dataset to the training and testing parts.

    Saturday, February 28, 2015 8:35 AM
  • It's probably best to experiment with different features and see what kind of results the model gives.

    About the target leakage, if you split the data in the very beginning and ensure that there are no connections going from test part to training part, you should be OK. The problem arises in if a movie is used to compute actor_avg and is then used in test dataset, because the actor_avg feature was computed using the value the model tries to predict.

    Hope the explains it,

    Roope

    Sunday, March 01, 2015 12:01 AM
    Moderator
  • Ok, thanks for the explanation.

    And how to correctly predict with multiple datasets? Just with Join module joining two datasets from the right and with rating dataset from the left? (picture bellow is an example only)

    What is the limit of rows and columns for the Join module? Because when I tried to connect almost all of datasets earlier, I got an error "Error 0000: Internal error" with status detail "Process exited with error code -2" and output log

    Record Starts at UTC 03/01/2015 10:37:52:
    
    Run the job:"/dll "Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.Join.Dll.Join;Run" /Output0 "..\..\Results dataset\Results dataset.dataset" /table1 "..\..\Dataset1\Dataset1.dataset" /table2 "..\..\Dataset2\Dataset2.dataset" /keys1 "%7B%22isFilter%22%3Atrue%2C%22rules%22%3A%5B%7B%22ruleType%22%3A%22ColumnNames%22%2C%22columns%22%3A%5B%22Movie%20ID%22%5D%2C%22exclude%22%3Afalse%7D%5D%7D" /keys2 "%7B%22isFilter%22%3Atrue%2C%22rules%22%3A%5B%7B%22ruleType%22%3A%22ColumnNames%22%2C%22columns%22%3A%5B%22Movie%20ID%22%5D%2C%22exclude%22%3Afalse%7D%5D%7D" /caseSensitive "True" /joinType "Inner Join" /keep2 "False" "
    Starting Process 'C:\Resources\directory\4ab9abba3c514c43a572dd4779ca11e5.SingleNodeRuntimeCompute.Packages\AFx\6.2\DllModuleHost.exe' with arguments ' /dll "Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.Join.Dll.Join;Run" /Output0 "..\..\Results dataset\Results dataset.dataset" /table1 "..\..\Dataset1\Dataset1.dataset" /table2 "..\..\Dataset2\Dataset2.dataset" /keys1 "%7B%22isFilter%22%3Atrue%2C%22rules%22%3A%5B%7B%22ruleType%22%3A%22ColumnNames%22%2C%22columns%22%3A%5B%22Movie%20ID%22%5D%2C%22exclude%22%3Afalse%7D%5D%7D" /keys2 "%7B%22isFilter%22%3Atrue%2C%22rules%22%3A%5B%7B%22ruleType%22%3A%22ColumnNames%22%2C%22columns%22%3A%5B%22Movie%20ID%22%5D%2C%22exclude%22%3Afalse%7D%5D%7D" /caseSensitive "True" /joinType "Inner Join" /keep2 "False" '
    [ModuleOutput] DllModuleHost Start: 1 : Program::Main
    [ModuleOutput]   DllModuleHost Start: 1 : DataLabModuleDescriptionParser::ParseModuleDescriptionString
    [ModuleOutput]   DllModuleHost Stop: 1 : DataLabModuleDescriptionParser::ParseModuleDescriptionString. Duration: 00:00:00.0050351
    [ModuleOutput]   DllModuleHost Start: 1 : DllModuleMethod::DllModuleMethod
    [ModuleOutput]   DllModuleHost Stop: 1 : DllModuleMethod::DllModuleMethod. Duration: 00:00:00.0000598
    [ModuleOutput]   DllModuleHost Start: 1 : DllModuleMethod::Execute
    [ModuleOutput]     DllModuleHost Start: 1 : DataLabModuleBinder::BindModuleMethod
    [ModuleOutput]       DllModuleHost Verbose: 1 : moduleMethodDescription Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.Join.Dll.Join;Run
    [ModuleOutput]       DllModuleHost Verbose: 1 : assemblyFullName Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca
    [ModuleOutput]       DllModuleHost Start: 1 : DataLabModuleBinder::LoadModuleAssembly
    [ModuleOutput]         DllModuleHost Verbose: 1 : Trying to resolve assembly : Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca
    [ModuleOutput]         DllModuleHost Verbose: 1 : Loaded moduleAssembly Microsoft.Analytics.Modules.Join.Dll, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca
    [ModuleOutput]       DllModuleHost Stop: 1 : DataLabModuleBinder::LoadModuleAssembly. Duration: 00:00:00.0075148
    [ModuleOutput]       DllModuleHost Verbose: 1 : moduleTypeName Microsoft.Analytics.Modules.Join.Dll.Join
    [ModuleOutput]       DllModuleHost Verbose: 1 : moduleMethodName Run
    [ModuleOutput]       DllModuleHost Information: 1 : Module FriendlyName : Join
    [ModuleOutput]       DllModuleHost Information: 1 : Module Release Status : Release
    [ModuleOutput]     DllModuleHost Stop: 1 : DataLabModuleBinder::BindModuleMethod. Duration: 00:00:00.0114818
    [ModuleOutput]     DllModuleHost Start: 1 : ParameterArgumentBinder::InitializeParameterValues
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos count = 7
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[0] name = table1 , type = Microsoft.Numerics.Data.Local.DataTable
    [ModuleOutput]       DllModuleHost Start: 1 : DataTableDatasetHandler::HandleArgumentString
    [ModuleOutput]       DllModuleHost Stop: 1 : DataTableDatasetHandler::HandleArgumentString. Duration: 00:00:18.9288147
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[1] name = table2 , type = Microsoft.Numerics.Data.Local.DataTable
    [ModuleOutput]       DllModuleHost Start: 1 : DataTableDatasetHandler::HandleArgumentString
    [ModuleOutput]       DllModuleHost Stop: 1 : DataTableDatasetHandler::HandleArgumentString. Duration: 00:00:13.1633739
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[2] name = keys1 , type = Microsoft.Analytics.Modules.Common.Dll.ColumnSelection
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[3] name = keys2 , type = Microsoft.Analytics.Modules.Common.Dll.ColumnSelection
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[4] name = caseSensitive , type = System.Boolean
    [ModuleOutput]       DllModuleHost Verbose: 1 : Converted string 'True' to value of type System.Boolean
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[5] name = joinType , type = Microsoft.Analytics.Modules.Join.Dll.Join+JoinType
    [ModuleOutput]       DllModuleHost Verbose: 1 : Converted string 'Inner Join' to enum of type Microsoft.Analytics.Modules.Join.Dll.Join+JoinType
    [ModuleOutput]       DllModuleHost Verbose: 1 : parameterInfos[6] name = keep2 , type = System.Boolean
    [ModuleOutput]       DllModuleHost Verbose: 1 : Converted string 'False' to value of type System.Boolean
    [ModuleOutput]     DllModuleHost Stop: 1 : ParameterArgumentBinder::InitializeParameterValues. Duration: 00:00:32.2560248
    [ModuleOutput]     DllModuleHost Verbose: 1 : Begin invoking method Run ... 
    [ModuleOutput] InputDataStructure
    [ModuleOutput] 
    [ModuleOutput] {
    [ModuleOutput] 	"InputName":Dataset1
    [ModuleOutput] 	"Rows":22816683
    [ModuleOutput] 	"Cols":4
    [ModuleOutput] 	"ColumnTypes":System.Int32,3,System.String,1
    [ModuleOutput] }
    [ModuleOutput] InputDataStructure
    [ModuleOutput] 
    [ModuleOutput] {
    [ModuleOutput] 	"InputName":Dataset2
    [ModuleOutput] 	"Rows":11201249
    [ModuleOutput] 	"Cols":6
    [ModuleOutput] 	"ColumnTypes":System.Int32,2,System.String,4
    [ModuleOutput] }
    [ModuleOutput]   DllModuleHost Stop: 1 : DllModuleMethod::Execute. Duration: 00:01:29.3937713
    [ModuleOutput]   DllModuleHost Error: 1 : Program::Main encountered fatal exception: Microsoft.Analytics.Exceptions.ErrorMapping+ModuleException: Error 0000: Internal error ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.OutOfMemoryException: Array dimensions exceeded supported range.
    Module finished after a runtime of 00:01:29.9721463 with exit code -2
    Module failed due to negative exit code of -2
    
    Record Ends at UTC 03/01/2015 10:39:30.
    
    
    

    Sunday, March 01, 2015 11:02 AM
  • This line in the error message indicates the nature of the error:

    System.OutOfMemoryException: Array dimensions exceeded supported range

    The dataset probably grew too large at some point during the joining. You could try sampling the dataset down to a smaller set to begin with, and then increase it incrementally. You could also try removing any columns you're not using right in the beginning.

    Hope this helps,

    Roope

    Monday, March 02, 2015 2:45 PM
    Moderator
  • I found it. But what is the limit? Is there any specific maximum size (dimensions, number of values, number of columns, ...) of the dataset?

    And am I doing it right when I'm trying to add more datasets to improve the prediction as I showed on the screenshot above?

    Thank you.

    Monday, March 02, 2015 2:51 PM
  • .NET array limitation most likely - int.MaxValue (~2.15 billion) is the limit for .NET array size

    I'll file a defect as we expect to allow a 10GB dataset end to end workflow, though of course a Join bloats size by quite a bit.

    AK

    Monday, March 02, 2015 4:10 PM
    Moderator
  • Hey Lukas,

    Do you mind sharing the datasets and module parameters you used for the OOM Join module? We'd like to ensure that we've identified the proper error. If so, please contact me at amlforum [at] microsoft [dot] com if you don't mind and we can take this offline.

    Thanks!

    AK


    Monday, March 02, 2015 8:22 PM
    Moderator
  • Hi AK.

    I sent you an email.

    Thx

    Tuesday, March 03, 2015 7:40 AM
  • Now when I have all datasets converted into the recommended form where actor, director, ... is replaced by the average rating of the movies he's appeared in, and the movie release year is transformed into the comparable and sortable numeric value, I should save the results as a new datasets and use it for the movie rating prediction instead of the original datasets, right?

    So I tried to build a new model with these new datasets.

    And the result looks MUCH better I think :)

    But now I have in my head a few improvements.

    1. As Yordan wrote above, how to use a year to improve the prediction?
    2. I have available a number of votes for each movie. Few votes (e.g. 5) means less credibility of the rating and a lot of votes (e.g. 50.000) means higher credibility of the rating. How to include it into the prediction?

    Thursday, March 05, 2015 11:13 AM