'EPiServer Find Content Indexing Job' deletes commerce data

Vote:
 

Using EPiServer.Find version 11.1.2.4113

This is what is logged in scheduled job history:

Indexing job [mysite] [content]: Reindexing completed. ExecutionTime: 17 minutes 32 seconds Number of contents indexed: 8909
Indexing job [Global assets and other data] [content]: Reindexing completed. ExecutionTime: 1 minutes 52 seconds Number of contents indexed: 1310 Number of content errors: 4

It consists of two parts:

  • 'mysite' reindex
  • 'Global assets and other data' reindex

Problem: While job is running commerce data is added in index and as a last step all commerce data is deleted, even tough it was just added to index

This is how it looks from fiddler:

For 'mysite' reindex:

a lot of:

POST http://es-eu-dev-api01.episerver.net/xxxx/_bulk HTTP/1.1

requests that indexes CMS data and that also include commerce data

as a last step this is sent over wire:

DELETE http://es-eu-dev-api01.episerver.net/xxxx/mysite/_query HTTP/1.1
Content-Type: application/json
User-Agent: EPiServer-Find-NET-API/11.1.2.4113
Host: es-eu-dev-api01.episerver.net
Content-Length: 330
Expect: 100-continue
Accept-Encoding: gzip, deflate

{
   "filtered":{
      "query":{
         "constant_score":{
            "filter":{
               "and":[
                  {
                     "range":{
                        "GetTimestamp$$date":{
                           "from":"0001-01-01T00:00:00Z",
                           "to":"2016-03-18T11:11:47.0955583Z",
                           "include_lower":true,
                           "include_upper":false
                        }
                     }
                  },
                  {
                     "term":{
                        "SiteId$$string":"4d260ec1-ec59-4bbf-8de4-2bf68eb15b9d"
                     }
                  }
               ]
            }
         }
      },
      "filter":{
         "term":{
            "___types":"EPiServer.Core.IContent"
         }
      }
   }
}

And thus far everything is fine, it would appear that only old content is removed and that is correct

Then second part is ran: 'Global assets and other data' reindex

also a lot of:

POST http://es-eu-dev-api01.episerver.net/xxxx/_bulk HTTP/1.1

requests, some of them fail with:

HTTP/1.1 413 Request Entity Too Large

should not to be related as that is only few missing items(andd all the commerce data was indexed in first part)

Then as a last request following is sent:

DELETE http://es-eu-dev-api01.episerver.net/xxxx/mysite/_query HTTP/1.1
Content-Type: application/json
User-Agent: EPiServer-Find-NET-API/11.1.2.4113
Host: es-eu-dev-api01.episerver.net
Content-Length: 396
Expect: 100-continue
Accept-Encoding: gzip, deflate

{
   "filtered":{
      "query":{
         "constant_score":{
            "filter":{
               "and":[
                  {
                     "range":{
                        "GetTimestamp$$date":{
                           "from":"0001-01-01T00:00:00Z",
                           "to":"2016-03-18T11:29:19.5380677Z",
                           "include_lower":true,
                           "include_upper":false
                        }
                     }
                  },
                  {
                     "or":[
                        {
                           "term":{
                              "SiteId$$string":"00000000-0000-0000-0000-000000000000"
                           }
                        },
                        {
                           "not":{
                              "filter":{
                                 "exists":{
                                    "field":"SiteId$$string"
                                 }
                              }
                           }
                        }
                     ]
                  }
               ]
            }
         }
      },
      "filter":{
         "term":{
            "___types":"EPiServer.Core.IContent"
         }
      }
   }
}

And what that does is:

delete everything that is IContent and does not have property SiteId in index, so it deletes all of commerce data(Nodes, Products, Variations as all of them inherit from IContent)

something similar is fixed in current version: http://world.episerver.com/documentation/Release-Notes/ReleaseNote/?releaseNoteId=FIND-811

but still similar problem exist

#146602
Mar 18, 2016 13:04
Vote:
 
        private string siteId;

       [Ignore]
        public string SiteId
        {
            get
            {
                if (string.IsNullOrWhiteSpace(this.siteId))
                {
                    this.siteId = "5EBC1E97-CC5A-4251-A2F6-E04A05E5C4DC";
                }

                return this.siteId;
            }
            set
            {
                this.siteId = value;
            }
        }

This is my current fix(adding it to Node, Product, Variation), but still Find scheduled job should be fixed

Its best that SiteId matches actual SiteId so when content is old its deleted

#146605
Edited, Mar 18, 2016 14:01
Vote:
 

It deletes everything that has no siteId AND a timestamp in the past (before the indexing started) so to clear the index of items that have been removed.

I would focus on the 413 Request Entity Too Large and dial the batch size down somewhat by tweaking the batch sizes.  Look at this:

http://antecknat.se/blog/2015/02/23/convention-for-episerver-find-to-ignore-large-files/

#146680
Mar 21, 2016 14:18
Vote:
 

Ok, lets say I add:

ContentIndexer.Instance.ContentBatchSize = 10;

Then next question, why 'EPiServer Find Content Indexing Job' needs to index everything twice, look at code on line 54, and how to get rid of it as it doubles everything:

// Decompiled with JetBrains decompiler
// Type: EPiServer.Find.Cms.Job.IndexingJob
// Assembly: EPiServer.Find.Cms, Version=11.1.2.4113, Culture=neutral, PublicKeyToken=8fe83dea738b45b7
// MVID: 97EFDC61-6868-439D-949B-8F9FA6949EAF
// Assembly location: C:\Projects\xxx\src\xxx\Bin\EPiServer.Find.Cms.dll

using EPiServer.BaseLibrary.Scheduling;
using EPiServer.Find.Cms;
using EPiServer.Find.Cms.BestBets;
using EPiServer.PlugIn;
using EPiServer.ServiceLocation;
using EPiServer.Web;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;

namespace EPiServer.Find.Cms.Job
{
  [ScheduledPlugIn(Description = "This indexing job is used to reindex all content. During normal operation changes to content are being indexed as they are made without rerunning or scheduling of this job.", DisplayName = "EPiServer Find Content Indexing Job", LanguagePath = "/EPiServer/Find/indexingJob", SortIndex = 10100)]
  public class IndexingJob : JobBase
  {
    private static readonly object jobLock = new object();
    private bool stop;

    public Injected<EPiServer.Web.SiteDefinitionRepository> SiteDefinitionRepository { get; set; }

    public IndexingJob()
    {
      this.IsStoppable = true;
    }

    public override string Execute()
    {
      SiteDefinition current = SiteDefinition.Current;
      try
      {
        Func<SiteDefinition, string> getNameOfDefinition = (Func<SiteDefinition, string>) (sd =>
        {
          if (SiteDefinition.Empty == sd)
            return "Global assets and other data";
          return sd.Name;
        });
        if (!Monitor.TryEnter(IndexingJob.jobLock))
          throw new ApplicationException("Indexing job is already running.");
        try
        {
          string str1 = string.Empty;
          if (Enumerable.Any<SiteDefinition>(this.SiteDefinitionRepository.Service.List()))
          {
            foreach (SiteDefinition siteDefinition in Enumerable.Concat<SiteDefinition>(this.SiteDefinitionRepository.Service.List(), (IEnumerable<SiteDefinition>) new SiteDefinition[1]
            {
              SiteDefinition.Empty
            }))
            {
              SiteDefinition.Current = siteDefinition;
              this.stop = false;
              StringBuilder statusReport = new StringBuilder();
              ContentIndexer.ReIndexResult reIndexResult = ContentIndexer.Instance.ReIndex((Action<ContentIndexer.ReIndexStatus>) (s =>
              {
                if (s.IsError)
                  statusReport.AppendLine(EPiServer.Find.Helpers.Text.StringExtensions.StripHtml(s.Message));
                this.OnStatusChanged("Indexing job [" + getNameOfDefinition(SiteDefinition.Current) + "] [content]: " + EPiServer.Find.Helpers.Text.StringExtensions.StripHtml(s.Message));
              }), new Func<bool>(this.IsStopped));
              str1 = str1 + "Indexing job [" + getNameOfDefinition(SiteDefinition.Current) + "] [content]: " + EPiServer.Find.Helpers.Text.StringExtensions.StripHtml(reIndexResult.PrintReport()).Replace("\n", "<br />") + "<br />";
              if (statusReport.Length > 0)
                str1 = str1 + statusReport.ToString().Replace("\n", "<br />") + "<br />";
              string str2 = ExternalUrlBestBetHandlers.ReindexExternalUrlBestBets();
              if (str2.Length > 0)
                str1 += str2;
            }
          }
          else
            str1 += "No sites have been configured. Please go to the 'Manage Websites' section to add a site configuration.";
          return str1;
        }
        finally
        {
          Monitor.Exit(IndexingJob.jobLock);
        }
      }
      finally
      {
        SiteDefinition.Current = current;
      }
    }

    public override void Stop()
    {
      this.stop = true;
    }

    public bool IsStopped()
    {
      return this.stop;
    }
  }
}
#146705
Edited, Mar 21, 2016 17:16
Vote:
 

No, that's rushing the conclusion...

Each parallell thread will take ten batches (but no more than 1000 items) and on line 117: 

foreach (IEnumerable<IContent> source in this.Batch<IContent>((IEnumerable<IContent>) list2, this.ContentBatchSize))
                  this.IndexBatch((IEnumerable<IContent>) Enumerable.ToList<IContent>(source), statusAction, ref numberOfContentErrors, ref indexingCount);

it will index no more than ContentBatchSize items to the service.  That should answer your first concern.

Your second question relates to how it does indexing of site specific data versus the global assets and other data. The items are provided by the IReIndexInformation implementing class(es). It will first index all items under a (specific) site root. Later it will index all items that lives outside of a site root, these include global assets and other data. That's why it's named so.

It should not double anything, if you have duplicates in your index, please give examples.  Thanks!

#146708
Edited, Mar 21, 2016 17:40
Vote:
 

Yes, you are right about ContentBatchSize it is correct, I was rushing, sorry about that!

#146711
Edited, Mar 21, 2016 18:32
Vote:
 
#146713
Edited, Mar 21, 2016 18:39
Vote:
 

So it is true as @ksjoberg says commerce data that is added using IReIndexInformation is indexed only in 'site' reindex and not under - 'Global assets and other data' reindex.

Problem here was, as a workaround for 'Global assets and other data' reindex not to delete commerce data I had added:

        private Guid siteId;

        [Ignore]
        public Guid SiteId
        {
            get
            {
                if (this.siteId == Guid.Empty)
                {
                    this.siteId = Guid.Parse("5EBC1E97-CC5A-4251-A2F6-E04A05E5C4DC");
                }

                return this.siteId;
            }
            set
            {
                this.siteId = value;
            }
        }

And still commerce data was deleted, but infact should have added:

private string siteId;
 
[Ignore]
 public string SiteId
 {
     get
     {
         if (string.IsNullOrWhiteSpace(this.siteId))
         {
             this.siteId = "5EBC1E97-CC5A-4251-A2F6-E04A05E5C4DC";
         }
 
         return this.siteId;
     }
     set
     {
         this.siteId = value;
     }
 }

difference is between string and Guid

if its Guid then this request:

DELETE http://es-eu-dev-api01.episerver.net/xxxx/mysite/_query HTTP/1.1
Content-Type: application/json
User-Agent: EPiServer-Find-NET-API/11.1.2.4113
Host: es-eu-dev-api01.episerver.net
Content-Length: 396
Expect: 100-continue
Accept-Encoding: gzip, deflate
 
{
   "filtered":{
      "query":{
         "constant_score":{
            "filter":{
               "and":[
                  {
                     "range":{
                        "GetTimestamp$$date":{
                           "from":"0001-01-01T00:00:00Z",
                           "to":"2016-03-18T11:29:19.5380677Z",
                           "include_lower":true,
                           "include_upper":false
                        }
                     }
                  },
                  {
                     "or":[
                        {
                           "term":{
                              "SiteId$$string":"00000000-0000-0000-0000-000000000000"
                           }
                        },
                        {
                           "not":{
                              "filter":{
                                 "exists":{
                                    "field":"SiteId$$string"
                                 }
                              }
                           }
                        }
                     ]
                  }
               ]
            }
         }
      },
      "filter":{
         "term":{
            "___types":"EPiServer.Core.IContent"
         }
      }
   }
}

deletes all commerce data if its string it skips, as it looks to: 

  • 'SiteId$$string' -> that is how its called when string
  • and not 'SiteId' -> that is how its called when Guid

Also original response is modified to use string instead of Guid

#146745
Edited, Mar 22, 2016 13:35
Vote:
 

However now question remains:

Is it normal practice that I need to add property(string SiteId) to entities that are added by IReIndexInformation, so that they are not deleted by 'Global assets and other data' reindex job?

#146746
Edited, Mar 22, 2016 13:44
Vote:
 

Hi,

Let me break the query down that you posted above:

{
   "filtered":{
      "query":{
         "constant_score":{
            "filter":{
               "and":[
                  {
                     "range":{
                        "GetTimestamp$$date":{
                           "from":"0001-01-01T00:00:00Z",
                           "to":"2016-03-18T11:29:19.5380677Z",
                           "include_lower":true,
                           "include_upper":false
                        }
                     }
                  },
                  {
                     "or":[
                        {
                           "term":{
                              "SiteId$$string":"00000000-0000-0000-0000-000000000000"
                           }
                        },
                        {
                           "not":{
                              "filter":{
                                 "exists":{
                                    "field":"SiteId$$string"
                                 }
                              }
                           }
                        }
                     ]
                  }
               ]
            }
         }
      },
      "filter":{
         "term":{
            "___types":"EPiServer.Core.IContent"
         }
      }
   }
}

What it does is: it deletes all documents that were indexed before this indexing run was performed AND (has the SiteId-field set to Guid.Empty OR not having the SiteId-field).

So to answer your question: No, adding that property is not required. Performing that query will only delete items that were not indexed this time around, allowing you to perform a full index without emptying the index first.

#147262
Apr 08, 2016 13:40
Vote:
 

There are two sub jobs that are done by epi.find reindex job:

1.) Indexing job [mysite] [content] - will call it first job
2.) Indexing job [Global assets and other data] [content] - will call it second job


1.)
First job takes timestamp when it starts and then reindexes all site content(that includes commerce data(what inherits from IContent, as it is mannualy added using IReIndexInformation) as well)
Afterwards first job runs delete and correctly deletes only old data


2.)
Second job takes timestamp
It looks at images files and other data
Then afterwards deletes everything that is IContent and has old timestamp

As commerce data is indexed in first job, and as by default it does not have SiteId property, all the commerce data is deleted,
unless that property is added to variation, product and node entities.


Does this clarifies it?

#147264
Edited, Apr 08, 2016 15:05
Vote:
 

Hi,

Thank you, I think I got it. We're looking into it. It is not expected that the first (site-specific) indexing job would add items to the index that are non-site-specific, but as you are saying, it is.

#147299
Apr 11, 2016 11:22
Vote:
 

Just to verify. Going back to your last post; you're saying that the second job doesn't index commerce data (again)?

#147311
Apr 11, 2016 17:41
Vote:
 

Yes, it does not, and that is what I expect as it is done with first job(so only once)

#147312
Apr 11, 2016 18:00
Vote:
 

I have created a bug on this (COM-1428), which should be public soon.

#147332
Apr 12, 2016 10:44
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.