JOIN
Get Time
forums   
Search | Watch Thread  |  My Post History  |  My Watches  |  User Settings
View: Flat  | Threaded  | Tree
Previous Thread  |  Next Thread
fmow_baseline | Reply
The fmow_baseline code can be found here: https://github.com/fmow/baseline
Re: fmow_baseline (response to post by kbowerma) | Reply
Thanks for sharing, your methods and results are very good. Are you able to say what score the CNN and metadata model gets?
Re: fmow_baseline (response to post by yellowsubmarine) | Reply
Thank you. Detailed results can be found in the paper, which should go live on arXiv early next week. We'll update this thread when that happens.

Also, we should note that submissions from this account are not eligible for any prizes. We submitted to raise awareness of the baseline and its performance. We hope that the ideas and code presented in the baseline help participants achieve higher levels of performance on this difficult problem.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Thanks!

How many epochs did it take for the submitted score? what GPUs?

The baseline code has 128 batch size for CNN which I don't think would even fit in the K80's GPU. I understand the models we provide need to be able to complete everything (preprocessing, training and inference) in 7 day max using either AWS p2.xlarge (GPU) or m4.10xlarge (CPU).
Re: fmow_baseline (response to post by antor) | Reply
Pre-trained models are available here: https://github.com/fMoW/baseline/releases/tag/paper

We trained the CNN for 36 epochs at a learning rate of 1e-4, and we trained the LSTM for 100 epochs at a learning rate of 1e-4 followed by 25 epochs at a learning rate of 1e-5. We used a machine with four P100-SXM2 GPUs, which allowed us to set larger batch sizes.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Very nice.

Do you have any plans to make submission/score subjecting your training regime to the constraints of the final stage? (max training of 7 days on a single K80-level GPU).

Can those pre-trained models be downloaded by our code and used in our own submissions?
Re: fmow_baseline (response to post by antor) | Reply
We do not currently have any plans to retrain using a K80-level GPU. Changing the batch size to fit on a smaller GPU shouldn't have much of an effect on the final score. Have you tested to see if training would take longer than 7 days?

At any rate, we believe the rules allow for using pre-trained models such as the ones we have released. See the General Notes section within the Problem Statement.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
I benchmarked K80 (on a p2.xlarge AWS provided in the intermediary progress prize) training using your baseline codebase with a less complex feature extractor than DenseNet and it takes 38600s (~11 hours per epoch), so with that setup you only have ~14-16 epochs to train the CNN (didn't try the LSTM one). DenseNet will take even more b/c of smaller batch size.
Re: fmow_baseline (response to post by antor) | Reply
I just ran a couple epochs training DenseNet on a K40 with batch size of 26 and saw ~25% faster speeds per epoch than what you mention. Perhaps there are more tricks that can be employed to get a larger batch onto the GPU. It may also be worth replying to walrus71 in this thread: https://apps.topcoder.com/forums/?module=Thread&threadID=906108&start=0

In other news, our paper went live on arXiv last night. Please see https://arxiv.org/abs/1711.07846 for more details about the data and baseline performance.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Thank you for kindly share your baseline code. Please if possible to reply, I have some doubts about it:

1)
the variable
self.params.cnn_seq2seq_layer_length
doesn't exist in the (baseline-paper) code so I assume that
cnn_lstm_layer_length = params.cnn_seq2seq_layer_length = 2208
Is this correct or I'm doing something wrong?

2)
I'm trying to train your code in my machine but they are larger than mine:

Models released:
204M cnn_image_and_metadata.model
726M lstm_image_and_metadata.model

Models trained in my machine (Geforce GTX also using keras)
606 M cnn_image_and_metadata.model
2.2G lstm_image_and_metadata.model

Do you know if this is because the GPU/Keras or DeepNet or if there is something different in the code released?

Thank you very much.
Re: fmow_baseline (response to post by usf_bulls) | Reply
1) You're correct, good catch! Should be fixed now.
2) Did you use make_parallel during training? We had similarly sized models saved by ModelCheckpoint (cnn_image_and_metadata.hdf5 - 605MB, lstm_image_and_metadata.hdf5 - 2.12GB). We removed the parallelization (e.g., https://github.com/kuza55/keras-extras/issues/3#issuecomment-278125199) and re-saved the models before uploading to GitHub.

Hope that helps ease your doubts :)
Re: fmow_baseline (response to post by kbowerma) | Reply
Do you have any idea how long it took to generate the CNN codes? Currently with our model (that is using make_parallel), it seems like it will take around 5-6 days, which seems awfully long for this.
Re: fmow_baseline (response to post by Ritwik_G) | Reply
On further analysis, I realized that this was largely due to GPU memory thrashing when trying to do this in parallel. By loading the model on one GPU only, and then performing all inference on that GPU, the ETA is now down to 24 hours for the whole task.
Re: fmow_baseline (response to post by kbowerma) | Reply
Hey, I don't think that the LSTM batch helper implementation is correct as provided on the master branch.

1) Nowhere in _load_lstm_batch_helper is the metadata loaded or appended to the CNN codes.
2) inputDict['lastLayerLength'] will be equal to 2253 if metadata is set to True (params.cnn_lstm_layer_length + params.metadata_length), but later on, you try to assign cnnCodes of length 4096 there, leading to an error, specifically: ValueError: could not broadcast input array from shape (4096) into shape (2253)

Can you verify that the LSTM code on the master branch works as intended?

Ritwik
Re: fmow_baseline (response to post by Ritwik_G) | Reply
Apologies for the confusion. I somehow replied under another account. Repeating my response/question here in case that gets deleted:

1) The metadata is appended as part of the codes generation.
2) Are you saying that in line 247 of mlFunctions.py -- cnnCodes = json.load(open(currData['cnn_codes_paths'][codesIndex])) -- it is loading codes of length 4096?
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Hey!

It turns out it was all confusion.

There was some confusion between the comments in the code and the actual intent behind the code extraction. data_ml_functions/mlFunctions.py#L175 says "Custom generator that yields a vector containign the 4096-d CNN codes output by ResNet50 and metadata features...", which led to me generating codes that were supposed to be 4096-d, hence I edited the method to reflect that.

I think the comment is a vestige from the old baseline, and the intent was to create 2208-d codes + metadata. This was what was causing the error, and once I fixed that, it all worked.

Would you be able to push a commit that updates that comment?
Thanks! And sorry again for the confusion. Since I'm using make_parallel, I'm having to edit a lot of the code and got caught up in the comments instead of the single GPU model summery.
Re: fmow_baseline (response to post by Ritwik_G) | Reply
Ah, great catch! We didn't update the comments. Apologies for the confusion. We'll update those soon. Thanks!

As for using make_parallel, there shouldn't be many changes required. We changed very few lines to remove that.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Hi guys,

could you please specify where the improved performance (as compared to the first baseline) comes from ?
Re: fmow_baseline (response to post by Mloody2000) | Reply
The following diff shows most of the important changes: https://github.com/fMoW/baseline/commit/17d86046be202f46f6e6604f0ca47652770315ab.

We believe better handling of spatial context, metadata features, and using DenseNet as our feature extractor were the primary factors that improved performance.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Would it be possible to publish the 3 files

data/working/dataset_stats.json
data/working/cnn_codes_stats_no_metadata.json
data/working/cnn_codes_stats_with_metadata.json

used by the pre-trained models?
I think they're necessary for reproducing the results exactly.
Re: fmow_baseline (response to post by pfr) | Reply
The following command, using the baseline code and the fMoW-rgb dataset with val sample false_detection boxes, will generate these files for you:
python runBaseline.py -prepare


If you would like, we can release these files after the challenge ends. We don't want to release these files now as we are very close to the end of the challenge and they may provide an unfair advantage.
Re: fmow_baseline (response to post by fMoW_baseline) | Reply
Can you release them now? I would like to understand why I wasn't able to reproduce the exact baseline score.
RSS